February 14, 2008

Wordcounts in RID

Commercial dictionaries often use their wordcount as a selling point. This was particularly true during the heyday of the "unabridged" English dictionaries, when publishers competed to claim the highest number of words. For one thing, no dictionary has ever been truly unabridged. It's just a marketing gimmick. For another thing, what constitutes a "word" itself is problematic--do you count entries? do you count senses? Some publishers certainly intentionally inflated and even fudged their counts. Unabridged dictionaries eventually boasted figures past the half-million mark. In 1961, Webster's Third New International Dictionary broke with that trend, following strict guidelines about what words were and weren't suitable for inclusion. It nixed all proper nouns, for instance. While its 1934 predecessor, Webster's Second, had claimed 600,000 words, the Third only claimed 450,000. (Ironically, its publishers still chose to add Unabridged to the title in later printings. Goes to show how meaningful that word is.)

Obviously, wordcount isn't a good measure of dictionary quality.
In Thai we don't see this as much. The notable exception that I've seen is Wit Thiengburanatham (วิทย์ เที่ยงบูรณธรรม), whose library-size Thai-English dictionary boasts on its front cover to contain more than 80,000 words. Anyone who has used this dictionary knows what a large percentage of these are flora and fauna names, which I would guess is what makes up most of the different. These words are certainly valuable to record, but it goes to show why having a huge count doesn't necessarily mean more words that will be useful to the average user. But even if wordcounts aren't a good gauge of overall quality, they can still be a (very rough) gauge of coverage.

In 2005, during the early days of the project that became my senior thesis, I wanted to know how RID stacked up. Believe it or not, in this computer age, I counted the entries. I had no digital texts I could use, and since I watched a lot of TV anyway, whenever I found myself with downtime I'd count away, doing each letter of the alphabet separately, periodically jotting the number down on the corner of the page so I wouldn't lose my count. I kept a table on the flyleaf. I did one pass to count headwords (e.g. ใจ) and another pass to count subheads (e.g. ใจดี). Oh yeah, and did I mention I did this for the 1950, 1982 and 1999 editions? I know, it's embarrassing to admit, like saying I translated Harry Potter into Klingon. But lest you should mourn for my social life, I was engaged when I started and married by the time I finished. My wife thought I was crazy. Still does, most of the time. I also wonder.

Nowadays, I have digital texts of RID82 and RID99 (although the latter has lots of missing entries). And though I could check my counts, I haven't. The percentage difference is sure to be small. Or maybe I'm just scared I'll discover that I don't know how to count as well as I thought I did.

Net Increase9,7895,288
% Increase40.6%15.6%

The curious can see letter-by-letter breakdowns in the spreadsheet I made at the time, which I've uploaded to Google Docs. (It also contains a partial count of Matichon.)

Its interesting to note that while there were significant gains in both headwords and subheads between RID50 and RID82, there were less than a thousand new headwords in RID99, but upwards of 5,000 subheads. The differences aren't just new words, though. There are lots of changes (large and small) to the definitions between editions, mostly in the direction of beefing them up and adding more senses. And even though the total number of heads and subheads in RID99 is less than half of what Dr. Wit's dictionary claims, this does not even begin to capture the number of senses in the dictionary, so it's difficult to compare the two.

In the end, though, it's clear RID isn't anywhere near a complete record of the Thai language. It's probably a nearly complete record of Standard Thai, that mythical creature that is taught to all and spoken by none. Actual Thai is much harder to put in a book, much harder even to define.

Contenders in the commercial sector like Matichon are helpful, because by taking the descriptive approach to lexicography, they're helping force the Royal Institute to come out of its prescriptive shell, as seen by last year's release of their Dictionary of New Words, Vol. 1 (พจนานุกรมคำใหม่ เล่ม ๑). That's a good thing, because unless they continue to adapt their methods, their dictionary will become obsolete. If they keep up the same pace of their last dictionary, the next edition of RID will be released in 2024. And if there are only a few thousand new words and compounds in it, people will wonder if the decades of work were worth the trouble.


  1. Can anyone tell me why I'm getting that big blank space before the table? It's there in both FF and IE. It doesn't show up while I'm writing the post, nor when I go back to edit it. I've tried changing lots of things around to see if it might display normally. Help!

  2. I'm not sure if blogger lets you edit the HTML directly, but if it does you can get rid of the big blank space by removing the <br /> between the <tr> and <td> tags.

    That is, from <tr><br /><td> to <tr><td> and </td><br /></tr> to </td></tr>

  3. Brilliant. It does let you edit the HTML in a primitive way, but it was adding in the <br /> tags automatically. So I just had to go in and make the whole table one line of code. Thanks!

  4. "Unabridged" doesn't mean 'complete,' it just means 'not abridged' (in contrast to a smaller, abridged dictionary). My Harper-Collins German Dictionary is the Unabridged Edition in contrast to the College Dictionary (and whatever other smaller versions they put out); similarly the M-W Unabridged is called that in contrast to their Collegiate Dictionary. No reputable dictionary would claim to have every word in the language.

  5. If that were the common conception, that would be one thing.

    Of course no dictionary contains every word, but that's what the term implies to the user, and wordcount has long been a marketing strategy for "proving" coverage, even though there's no standard way to count.

    Just read the M-W entry for 'unabridged'--it reads like it was written by an apologist:

    1 : not abridged : complete <an unabridged reprint of a novel>

    2 : being the most complete of its class : not based on one larger <an unabridged dictionary>

    Sense 1 is complete, ahem, cough, cough. 'Unabridged' in the sense of 'the most inclusive dictionary produced by a given publisher' requires a new definition. And it goes against the common conception of "the dictionary" as this Platonic ideal of an authoritative book that defines every word. Believe me, I know that's a B.S. idea, but why do you think we use the definite article anyway? Most folks don't check "a dictionary", they check "the dictionary".

    During the heyday of the unabridged dictionary, they were marketed under definition 1, even if the lexicographers themselves subscribed to definition 2.

    As I alluded to in my post, I think this discrepancy is demonstrated by the difference between Webster's Second and Third. The Second had 600,000 words, and the Third had 450,000, clearly "abridged" in comparison to its previous incarnation, they still market it as "unabridged" in comparison to College and other versions of the Dictionary (but only several years later). That's a marketing justification if I've ever heard one.

    So, yeah. That's where I stand. Calling it unabridged is meant to give the impression of completeness. If the only purpose were to distinguish the basic dictionary sizes, they could use different nomenclature.

    To get back to Thai, some publishers have a three-way distinction between small "pocket" edition (ฉบับพกพา chabap phok phaa), a medium "desk" edition (ฉบับตั้งโต๊ะ chabap tang to), and the large "library" edition (ฉบับห้องสมุด chabap hong samut).

    Thai has no comparably industry catchphrase for "unabridged" to compare with English, though. You see the word สมบูรณ์ sombuun bandied about, but it's applied to such ridiculously small dictionaries as to be completely meaningless.

    BTW, language, do you write a blog? Your Blogger profile isn't viewable, but if you do I'd love to read it.

  6. Oh, sure, you're absolutely right about the common understanding, and probably about the marketing thing as well, but I just wanted to point out the literal meaning. I'm particularly aware of it because I myself, despite my linguistic training and wide reading, had the common idea in my head until relatively recently, when my mutterings about a word not being in my "unabridged" were interrupted by the thought that, gee, all that means is that it's not abridged from a larger version. (In your example of the 2nd and 3rd Internationals -- which sounds like a discussion of Marxism, doesn't it? -- the Third could be considered an abridgment of the second, but that's not how the term is used in that context.)

    Sorry, I usually sign myself "languagehat" when I remember that it's just going to show up as "language" in Blogger, but I forgot. I blogged your latest post here.

  7. I'm already a regular reader, as it turns out. Glad to have your comments, and thanks for blogging my blog (blog blog blog)!