February 8, 2008

RID online version 0.9: a closer look

As I surmised in last night's post would happen, the online Royal Institute Dictionary got more than just a makeover today. Some of my previous gripes have been dealt with, either in today's update or else recently. Some of the changes:
  • You no longer get 40 "not found" messages along with your results.
  • Wildcard searching now works like it should, both for queries like ใจ* and *ใจ.
  • Even though large chunks of the text are still missing from the browsable text, you can now find words within those sections, such as ใส่ and ตาม, with the search function. Better than nothing.
There's also a new suggestion feature when your query returns no results. I discovered it by searching โหร- and it returned โหร. Turns out there's no entry with a hyphen. Suggestion is a great idea, but my first reaction is that most of the time no suggestion is given. I tested it with various words to see what it suggested.

ประเสิรฐ is a simple misspelling of ประเสริฐ, but no suggestion was made.
สัญญา returned สัญญา, but สัญญาณ would have been what I was looking for.
เศษฐกิจ returned เศรษฐกิจ, as expected.
สัญชาตญา returned nothing. The correct spelling is สัญชาตญาณ.
เหรีย returned เหียน, but I was hoping for เหรียญ.

A few more simple misspellings:
โอกา returned nothing.
ประกา returned nothing.
อากา returned nothing.
โลกาภิวัน์ returned nothing.
ปราก returned nothing.
ประมา returned nothing.

This feature isn't doing so well in my tests--all of which should have been very easy for it. There's a lot they could do to easily improve this feature. Like swap out similar consonants for each other, for example. If it did that it would pass pretty much all of my tests. It's probably using a fancy mathematical similarity rating, when a common sense approach to detecting spelling errors would be effective in most real-world scenarios. Another thing: suggestions are only given
when no result is found for your query. Sometimes you can't find the right spelling for the word you want, but your misspellings coincide with other real words (like the time I just couldn't seem to find the right spelling of ปรอท [ปะหฺรอด]). It would be nice to see similarity suggestions as an option for all queries.

There are other new "features", too.
Beyond fixing the simple asterisk wildcard search, there are also plenty of unadvertised ways to search, if you're familiar with some basic regular expression notation. Some of this worked before, but I'll try to cover it in more detail later.

All said, this is a nice little upgrade for RID99 online.


  1. Thanks for the update, Rikker. Is the ใจ* example a subtle jab at me indicating that I should have done more thorough research for my ใจ post last week?

    I agree that many misspellings should be caught. Mechanically converting Thai words into phonetic forms is not difficult. Then you just compare the word given with all other words that match the phonetic form. In fact, that would be a great resource for people to look up words are unsure of. (In particular, foreigners and young students could benefit from that.)

  2. Not at all, not at all. It's just a good example of words that one would want to find the full set of.

    Also, phonetic conversion is one way of going about it, but it can be tricky. The old RID online contained the full phonetic for all words, not just words with ambiguous orthography, so they should have a good set of data to start from.

    Another way of doing basically the same thing would be query expansion. Searching for, say, ประทาน would search for anything that matched ป[รล]ะ[ทธฒฑ]า[นณญลร]. You could define the set of characters based on whether it was in initial or final position. So น would expand to [นณ] syllable-initially, and to [นณญลร] syllable finally.

    I'd like to know the specifics of what they're doing now. Can anyone reverse engineer it based on the results we get? Look at a search for ปรา, for example. It appears to do cluster simplification, add tone marks, and match all one-character-off words (e.g. ปราศ and เปรา).

  3. I have to admit, the site is still disappointing on the whole. Hello! 1995 called. They want their web developer back.

    The webmasters can't even bother to set the content-type of the page, which encourages people to set their browser to use TIS-620 by default, which is the cause of so many "font" (actually character set) rendering problems in the greater Thai IT community.

    Next, a simple search for "*" throws an error, indicating that the developer couldn't even be bothered to do any work at all. It's really awful and shameful. I would expect better from a high school student.

  4. ํYep, I'm with you on that one. And yet I prefer the simplicity of this site to the equally clueless busy-ness of their main site.

    The unfortunate situation we find ourselves in is that they've set the bar so low that this is a welcome improvement.

    Before this fix, when you still got 40 "not found" messages drowning out your found results, was a truly awful low. To think it stayed that way for half of a year...

  5. "Mechanically converting Thai words into phonetic forms is not difficult. "

    I'd really have to differ on that one. It's not that difficult for about 80% of Thai words, but once you get into the remainder with the implicit vowels, silent letters and vowels, implicit 'linker' syllables and the like it gets a lot more difficult very quickly!

  6. As Mike, or Glenn at Thai-Language.com, can attest, it's true what the adage says: the last 10% takes 90% of the effort.

    I wonder how hard it would be to implement something like Soundex for Thai. A quick Google search shows I'm not the first person to think of this. See here and here for mentions of Thai Soundex, though you can't access the actual articles.

  7. "I wonder how hard it would be to implement something like Soundex for Thai."

    I've actually been working on something similar to this myself, you can try it out if you like. You'll have to forgive the lack of design on the page though, as it was just a page I was using for testing how it works, not one I intended for public consumption.

    It works by using a combination of phonetic matching and a fancy mathematical similarity rating :). I'm still testing it out at the moment, but in its current state it gets all your examples above correct Rikker. With ปรอท , if you'd forgotten the exact spelling but knew the pronunciation was ปะหฺรอด, you could enter that in (with or without the pinthu) and it comes up with ปรอท as the first match.

  8. Mike, thanks for sharing the link. It scored 11/11 on my simple tests, compared to RID's 1/11. But then I have to assume you already knew it would. :)

    As for Levenshtein distance, I'm familiar with the idea, though I'd forgotten the name. It comes in particularly handy for typos like ประเสิรฐ, where all the characters are right, there's just an error of placement. Among others typo types.

  9. Well, I'm embarrassed by my last comment. I was so busy tripping over myself to play with your spelling checker that I didn't the part where you'd already tested it against my examples. What can I say, I think I might be that guy they're always referring to in legal cases... :P

  10. Well I'd thought I'd better make sure it did pass them all before posting, I didn't think it would look too impressive otherwise! I'd be interested to know if you managed to catch it out using any other words though. There's a bit of a weakness in it for catching ร / ล mistakes, but that should be fixed by tomorrow.

    The เหรียน typo is the most interesting example of the difficulties involved I think. Both เรียน and เหรียญ have a Levenshtein distance of one and เรียน is far and away the more common word, but yet there's no doubt someone typing เหรียน is much more likely to mean เหรียญ . The way I did is by converting to phonetic form first and ranking that higher than Levenshtein distance matches, but I'm sure there's other ways of going about it. Trying Google I noticed they get it right too, and I wonder what their method is. I doubt they do a phonetic conversion, instead my guess would be with their huge traffic they can track click counts on the 'Did you mean....?' suggestion links to get an idea of which is the most accurate when there's a couple of possibilities. That would have the additional advantage of working for every language rather than just Thai too.