June 27, 2007

Thai and Google

The internet is suddenly a richer place for the study of Thai. I'm not certain when this change took place, but it definitely happened within this calendar year.

What's the big change, you ask?

Google now automatically segments Thai text when it searches. That is to say, if you're searching for the word ความ, Google will now find it within a phrase like, say, ไม่มีความหมาย. Maybe this seems like a little thing, but it's not. This is huge.

This literally opens up the entire presence of the Thai language on the internet to studying. It's amazing to think how limited Google searches were under the old system.

What was searching on Google like before?

Well, the word or phrase you were searching had to be bounded on both sides by either a space or some kind of punctuation--something to tell the search engine where the word breaks were. There was a workaround--segmenting the text yourself using the tag <wbr> between each word in the HTML,
if you were determined to make your website more easily searchable. But what a pain.

Now you can actually get actual useful stats on the frequency of words and phrases in the magical interwebs.

I don't use any other search engine besides Google regularly, but I went ahead and bounced around to some others, and Yahoo! also does segmenting. Don't know when that happened. But MSN and Ask.com don't. The numbers speak for themselves. Take การ and ใจ, two extremely high-frequency words in Thai:

Google: 103,000,000 hits
Yahoo: 30,600,000 hits
MSN: 118,288 hits
Ask.com: 19,200 hits

21,100,000 hits
Yahoo: 5,880,000 hits
MSN: 130,138 hits
Ask.com: 2,010 hits

Pretty striking, no?

