October 26, 2007

Classifiers! Get your classifiers!

The Royal Institute, your friendly neighborhood language regulating organization, is all about the dissemination of knowledge. Their website is, frankly, not very good (too much style over functionality--their fancy menus don't display correctly in any browser but IE, blech), but I'll be darned if it isn't packed with goodies. Difficult to access, often half-baked goodies, but goodies nonetheless.

Case in point: classifiers. In Thai we know them as ลักษณนาม [ลัก-ษะ-ณะ-นาม, in case the four consonants in a row was throwing you]. As the Thai word indicates, they are a type of นาม (noun, though this word also means name), which tells you a characteristic (ลักษณะ). So, ลักษณะ + นาม (the vowel disappears
because of สมาส combining rules).

Now, classifiers can be quite a chore to remember. I mean, how often do you use กระบอก, the classifier for gun? Or remember that a pair of pants is only a ตัว, but a pair of socks is a คู่? And many a learner forgets, in the heat of the conversation, that a pencil is a แท่ง but a pen is a ด้าม.

There is some research that suggests that native speakers don't employ nearly so complex a set of classifiers as the textbooks would lead you to believe, and use a lot of generic and repeater classifiers (nouns which can be used as their own classifiers, like ร้านร้านหนึ่ง "a store"). Interestingly, it also shows that younger speakers are better at using the Royal Institute's prescribed classifiers (presumably because, statistically, younger Thais are more well-educated than their elders). In my uninformed opinion, I find it's a matter of situation. In a more formal situation, or especially in writing, it's more important to know the standard. In informal conversation, it's easy to get by using a lot of อัน and ตัว.

Which all brings me to this link. It is the full text of the Royal Institute's useful (if unimaginatively titled) booklet, ลักษณนาม ฉบับราชบัณฑิตยสถาน (roughly, The Royal Institute Book of Classifiers). It's the text of the sixth printing, to be precise. Kudos to them for putting it online. It is organized alphabetically by the word you want the classifier of. In other words, look for นาฬิกา to find เรือน. The most glaring error is the lack of searchability, followed closely by the decision to arbitrarily number the 22 pages of text, sometimes with multiple consonants-worth of classifiers on one numbered page. Even when you know the letter of the alphabet, you still have to hunt around to find exactly which page it's on. The next obvious shortcoming is the inability to see what words share a certain classifier. For example, to be able to look up in one shot all the words the Royal Institute says it's okay to use ตัว as a classifier for.

In an effort to remedy this, and completely without the Royal Institute's knowledge or permission, I've taken their text and put it in a Google Spreadsheet, viewable here. I've divided it up into different "sheets" by letter. This doesn't solve all of the problems, but it's a start. This should serve to make the Royal Institute's date more valuable to language learners like you and me. (Feel free to contact me for the XLS or DOC or TXT files I made from the RI data.)

And let me know in the comments if you find a particularly interesting classifier you didn't know existed before.

4 comments:

  1. It seems to me a comprehensive, searchable list of classifiers would be a very useful thing to have but there doesn't seem to be any out there. Another thing that would be useful is the ability to know which classifiers are the most common, for nouns which can be classified by more than one (e.g. how common is เชีอก vs ตัว for classifying ช้าง).

    I had this idea to create one for the upcoming t2e "v2.0" by:

    - Indexing the text of thousands of Thai web pages.
    - Space the words out, and analyze the text looking for noun/classifier patterns like Noun + Classifier + นี้/นั้น/Adjective or Noun + หลาย/บาง/ทุก/etc + Classifier.
    - Store all the stats in the database, and make it fully searchable.

    The problem I found was that, particularly for less commonly used nouns, the noun/classifier patterns actually don't occur that often and so the amount of text that needs to be processed in order to be confident in the result is very high. My initial version was using the indexed text from about 80,000 webpages but this is nowhere near enough, I'm thinking around a million is probably necessary. That's going to take a while to do, so it's still an ongoing project for now.

    BTW, is it possible to make this comment box any wider ? It's a bit difficult to use when typing long comments.

    ReplyDelete
  2. t2e 2.0, eh? I'm intrigued.. care to reveal what other new features we can look forward to? If you don't want to publish it in the comments here, you can always drop me an email at the address at the bottom of the page. :)

    That's a great idea, by the way, and I'm not surprised it's more challenging than it would at first seem. There are so many things that can complicate the sentence and get in the way of easy parsing.

    As for the comment box... I dunno. I'll look into it.

    ReplyDelete
  3. Well, www.thai2english.com/new.html has some of the new features on it, and I'll have a full list of changes when it's ready.

    Given the amount of time it takes to do the classifier list though, that part of it will take a bit longer and probably won't be ready until sometime in December.

    ReplyDelete
  4. Downloadable, eh? Interesting. Look forward to the full list of changes.

    ReplyDelete