Comments on Thai 101: RID online version 0.9: a closer look

Well I'd thought I'd better make sure it did pass ...

2008-02-09T21:02:00.000+07:00

Well I'd thought I'd better make sure it did pass them all before posting, I didn't think it would look too impressive otherwise! I'd be interested to know if you managed to catch it out using any other words though. There's a bit of a weakness in it for catching ร / ล mistakes, but that should be fixed by tomorrow.

The เหรียน typo is the most interesting example of the difficulties involved I think. Both เรียน and เหรียญ have a Levenshtein distance of one and เรียน is far and away the more common word, but yet there's no doubt someone typing เหรียน is much more likely to mean เหรียญ . The way I did is by converting to phonetic form first and ranking that higher than Levenshtein distance matches, but I'm sure there's other ways of going about it. Trying Google I noticed they get it right too, and I wonder what their method is. I doubt they do a phonetic conversion, instead my guess would be with their huge traffic they can track click counts on the 'Did you mean....?' suggestion links to get an idea of which is the most accurate when there's a couple of possibilities. That would have the additional advantage of working for every language rather than just Thai too.

Well, I'm embarrassed by my last comment. I was so...

2008-02-09T16:03:00.000+07:00

Well, I'm embarrassed by my last comment. I was so busy tripping over myself to play with your spelling checker that I didn't the part where you'd already tested it against my examples. What can I say, I think I might be that guy they're always referring to in legal cases... :P

Mike, thanks for sharing the link. It scored 11/11...

2008-02-09T13:25:00.000+07:00

Mike, thanks for sharing the link. It scored 11/11 on my simple tests, compared to RID's 1/11. But then I have to assume you already knew it would. :)

As for Levenshtein distance, I'm familiar with the idea, though I'd forgotten the name. It comes in particularly handy for typos like ประเสิรฐ, where all the characters are right, there's just an error of placement. Among others typo types.

"I wonder how hard it would be to implement someth...

2008-02-09T13:08:00.000+07:00

"I wonder how hard it would be to implement something like Soundex for Thai."

I've actually been working on something similar to this myself, you can try it out if you like. You'll have to forgive the lack of design on the page though, as it was just a page I was using for testing how it works, not one I intended for public consumption.

It works by using a combination of phonetic matching and a fancy mathematical similarity rating :). I'm still testing it out at the moment, but in its current state it gets all your examples above correct Rikker. With ปรอท , if you'd forgotten the exact spelling but knew the pronunciation was ปะหฺรอด, you could enter that in (with or without the pinthu) and it comes up with ปรอท as the first match.

As Mike, or Glenn at Thai-Language.com, can attest...

2008-02-08T15:26:00.000+07:00

As Mike, or Glenn at Thai-Language.com, can attest, it's true what the adage says: the last 10% takes 90% of the effort.

I wonder how hard it would be to implement something like Soundex for Thai. A quick Google search shows I'm not the first person to think of this. See here and here for mentions of Thai Soundex, though you can't access the actual articles.

"Mechanically converting Thai words into phonetic ...

2008-02-08T15:16:00.000+07:00

"Mechanically converting Thai words into phonetic forms is not difficult. "

I'd really have to differ on that one. It's not that difficult for about 80% of Thai words, but once you get into the remainder with the implicit vowels, silent letters and vowels, implicit 'linker' syllables and the like it gets a lot more difficult very quickly!

ํYep, I'm with you on that one. And yet I prefer t...

2008-02-08T14:27:00.000+07:00

ํYep, I'm with you on that one. And yet I prefer the simplicity of this site to the equally clueless busy-ness of their main site.

The unfortunate situation we find ourselves in is that they've set the bar so low that this is a welcome improvement.

Before this fix, when you still got 40 "not found" messages drowning out your found results, was a truly awful low. To think it stayed that way for half of a year...

I have to admit, the site is still disappointing o...

2008-02-08T14:22:00.000+07:00

I have to admit, the site is still disappointing on the whole. Hello! 1995 called. They want their web developer back.

The webmasters can't even bother to set the content-type of the page, which encourages people to set their browser to use TIS-620 by default, which is the cause of so many "font" (actually character set) rendering problems in the greater Thai IT community.

Next, a simple search for "*" throws an error, indicating that the developer couldn't even be bothered to do any work at all. It's really awful and shameful. I would expect better from a high school student.

Not at all, not at all. It's just a good example o...

2008-02-08T14:17:00.000+07:00

Not at all, not at all. It's just a good example of words that one would want to find the full set of.

Also, phonetic conversion is one way of going about it, but it can be tricky. The old RID online contained the full phonetic for all words, not just words with ambiguous orthography, so they should have a good set of data to start from.

Another way of doing basically the same thing would be query expansion. Searching for, say, ประทาน would search for anything that matched ป[รล]ะ[ทธฒฑ]า[นณญลร]. You could define the set of characters based on whether it was in initial or final position. So น would expand to [นณ] syllable-initially, and to [นณญลร] syllable finally.

I'd like to know the specifics of what they're doing now. Can anyone reverse engineer it based on the results we get? Look at a search for ปรา, for example. It appears to do cluster simplification, add tone marks, and match all one-character-off words (e.g. ปราศ and เปรา).

Thanks for the update, Rikker. Is the ใจ* example...

2008-02-08T13:53:00.000+07:00

Thanks for the update, Rikker. Is the ใจ* example a subtle jab at me indicating that I should have done more thorough research for my ใจ post last week?

I agree that many misspellings should be caught. Mechanically converting Thai words into phonetic forms is not difficult. Then you just compare the word given with all other words that match the phonetic form. In fact, that would be a great resource for people to look up words are unsure of. (In particular, foreigners and young students could benefit from that.)