March 9, 2008

Google's new trick

Sometime last year, Google started segmenting Thai text in webpages it indexes, dramatically increasing the number of hits for pretty much every word out there.

Well, Google has a new trick.

I noticed this week (though I can't confirm exactly when it started) that when you search for a given word, Google returns words with different tone marks from the word you queried. This may be a good thing or a bad thing, depending on what you want to use Google for.

Fortunately, you can get around it by using quotes around your search term. The only exceptions I've noticed to this are non-dictionary words, for which Google still seems to think it knows better than the user.

A few test cases (feel free to replicate these at home):
  • I searched กึง (an onomatopoeic word for a loud noise). I was fishing for กึ่ง ('half'). Of the 100 results on the first page, only one contained my actual target. The rest were the much more common word กึ่ง.
  • I searched ดิ้น ('wriggle, struggle'), fishing for ดิน ('dirt'). Similar story to above.
  • I searched กิ๊ฟ (a non-dictionary word, seen in loans like 'gift shop'). Returns a lot of hits for things like Giffarine (spelled กิฟฟาริน). But this is where it gets weird. Putting quotes around it not only doesn't restrict the search to only my exact query, it actually increased the number of reported hits. I have no idea why.
Another thing I've meant to write about is Google's spelling suggestions. One of the great things about search engines these days is the "did you mean..." feature, where they harvest the power of the gigantic language corpus that is the web to predict what you probably meant to search for if you misspell a word. I know Google does this for Thai, too (simple test search: ประมาน will return the suggestion ประมาณ). But it only offers suggestions for Thai search terms on the Thai version of their site, Google.co.th. And since I usually use Google.com, not co.th, I have no idea how long they've been doing this. For users of Google.co.th out there, is this a longstanding feature or something new?

10 comments:

  1. Any idea for the following Google-phenomnon:

    เอลียาห์ reveals in Google 4.420 hits (only)
    เอเลียส increases hits to 564.000

    The record holder, the google world champion, is, however, เอลียาห revealing 2.190.000 hits....

    ReplyDelete
  2. The spelling correction has been on google.co.th for a few months at least. I'm surprised you use the .com site though, I always get redirected back to co.th whenever I go there.

    Do you find the different tone mark thing useful ? I find it very annoying, as it makes using Google as a measure for how common a word is much more difficult. It wouldn't be so bad if it could be bypassed by using quotes, but as you mentioned that often doesn't work. It's also quite simplistic in comparison to their spelling correction - if you search for ไหม่ for instance, the spelling correction correctly suggests ใหม่ whereas the search results are for ไหม.

    Hopefully they either improve or remove it soon, in its present state it's actually driving me to use Yahoo instead sometimes.

    ReplyDelete
  3. @BloggStock: Be careful not to take Google's hit counts at their word. เอลียาห claims more than 2 million hits, but search "เอลียาห" (the same search term in quotes) and I only get a couple hundred reported hits, seemingly every one of which is really just a mis-segmentation of เอลียาห์, with the การันต์.

    Are you suggesting that Google should expand its searches to include, say เอเลียส in a search for เอลียาห์? As it happens, those are both names from the Thai translation of the bible, เอเลียส being Elias and เอลียาห์ being Elijah. Note also that the vowels aren't the same between the two names--ลียา vs. เลีย. I don't think it would be useful or particularly wise on Google's part to try to do this kind of expansion (at least not by default).

    @Mike: There's an easy way around getting rerouted to co.th. Just click the link on the bottom of the co.th page that says 'Google.com in English', or go directly to www.google.com/ncr (which I presume stands for "no country redirect"). On any system I've done this on, once you've done this, it won't redirect you from them on. Google.com and Google.co.th thus become distinct webpages again.

    I find the tone mark think annoying almost all the time. I can understand where they're coming from, kinda sorta maybe. But I think it's misguided. It should definitely not be a default option, or at least be easily disabled. It's drive me to use Yahoo instead, too, which incidentally in my experience often has more accurate hit counting (see the เอลียาห example above of 2 milion supposed hits with almost no authentic hits).

    ReplyDelete
  4. @rikker: No, I'm not suggeting to Google a different search string, specifically not do to include in the search for เอเลียส also เอลียาห์ - the latter with or without การันต์.

    The issue itself derived from a discussion "how to write the German name Elias in Thai letters" in a German discussion forum.

    The Problem is actually that in German "Elias" and "Elija" (engl. "Elijah") are both synonyms as well as in "different usage".

    Taking this quotation from Wikipedia (English)

    His name, Hebrew: אֵלִיָּהוּ / אֵלִיָּה, Standard Eliyáhu / Eliyáh Tiberian ʾĒliyyāhû / ʾĒliyyāh appears in the Greek of the Septuagint as Hλίού, Eliou but in the New Testament it is hellenized as Hλίας, Elias. In Arabic it is إلياس, Ilyaas. It has been variously translated as "Yah is God,” "YHWH is my El", "whose God is Yah,” "the strong Yah,” "God of Yah,” "Yah is my God,” "my God is Yah,” "My God is Jehovah", and according to Strongs Exaustive Concordance of the Bible "God of Jehovah".

    the "accident" happened on the level of old Greek translating/writing either the new or the old testament.

    Luther, while translating the Bible in German, used Elias...

    Thus, at least in Germany, Elias is the "protestantic" writing while Elija is the "catholic" writing ... of the same name with, ultimatley, the same pronunciation. I do not know at which extent this phenomenon does exist in English bibles, and (highly likely) their translation into pasa Thai.

    In summary and in conclusion: No, I do not want that Googles is changing its strategy. I also want to find frequency of เอเลียส and เอลียาห์ sepearately. Perhaps,เอลียาห is the nicest solution for a translation of an English Bible although the ho hip is actually not pronounced (both in English and in Thai :-).

    ReplyDelete
  5. @Rikker,

    Actually I'd tried that before but still ended up getting redirected. I'd forgotten that ages ago I'd set Firefox to block the Google cookies, so after changing that it's working ok now. Thanks!

    Although I think personally I prefer the Thai-oriented results that .co.th gives. There is (or was, at least) another useful feature on Google.co.th where it'd find synonyms of your search term, and it seemed to work pretty well to me. For instance, if you searched for "(some word) หมายความว่า" as I do sometimes if I come across an unfamiliar word, it would also match หมายถึงว่า and แปลว่า in addition to หมายความว่า. Another one I noticed was searching for รัชดาภิเษก would match (ถนน)รัชดา , and so on. It doesn't seem to work that way nowadays though, which is a bit of a shame. I guess they're still experimenting to see what works best, so I think there's still hope this irritating tone mark thing will be removed or refined to be useful at some point.

    ReplyDelete
  6. Man, the more I use Google to search Thai nowadays, the more annoying this "feature" gets.

    I hope they kill it, and soon.

    ReplyDelete
  7. Yahoo seems to have been infected by this dubious feature too now - using your earlier example of "กึง" or mine of "ไหม่" neither returns useful results. I'm having to venture out to the relatively uncharted waters of search.live.com to get any decent Thai language search results for uncommon words now. Very frustrating! I haven't emailed Google or Yahoo to actually complain though, maybe that'd be worth a go.

    ReplyDelete
  8. Egads, it's time to start complaining. This is a feature I'm sure now that I neither need nor want. It's an endless headache...

    ReplyDelete
  9. Well I sent Google a message through their contact form (a challenge to find in itself!), who knows if they'll do anything about it but I tried...

    This is what I said:
    "...within the last year or so a change seems to have been made to Google searches in the Thai language, with the effect that a Google search nowadays returns results that seem to ignore tone marks altogether. In my opinion, this often seems to lead to more irrelevant results, as many completely different Thai words differ in their spelling only by the tone mark and so ignoring them brings up matches that are unlikely to be useful. I think it's comparable to a what a search with ignoring the vowels would be like in English - a search for "pen" is unlikely to be helped by matches for "pan" and "pin" also, but this seems to be roughly the situation for Thai language searches at present.

    The importance of tone marks in Thai spelling also tends to mean that spelling mistakes based on getting the wrong tone mark are relatively rare, and adding a tone mark where there should not be one or vice versa are rarer still, and so any benefit of searching by ignoring the tone marks seems to be outweighed by the disadvantages. Indeed, ignoring the tone marks often conflicts with Google's own spelling suggestions - a search for the misspelling "ไหม่" for instance correctly brings up a suggestion for ใหม่ , whereas the results shown are for the much less likely ไหม, despite the fact that it also differs by only one letter different from the misspelling.

    I would like to please request that this ignoring the tone marks be removed, or at least a way to override it be possible (enclosing the search in quotes doesn't work).

    Many thanks in advance for your consideration."

    ReplyDelete
  10. Have you had a look at this recently Rikker ? It seems to have changed quite a bit for the better along the lines of what we were asking for. For my example of "ไหม่" for instance, the results of the first page are now for both that and ใหม่ , not irrelevant ones of ไหม any longer, and it's a similar story with your examples. Testing briefly with a few other searches, it still seems to show you results with other tone marks in some situations, but generally seems to do it in a much more intelligent and relevant way now.

    Definitely looks like a welcome improvement anyway - thanks Google!

    ReplyDelete