January 25, 2008

Royal Institute to host conference on national language policy

It completely slipped my mind to post about this conference:
NEW DATE: July 4-5, 2008

The Royal Institute of Thailand, in cooperation with UNESCO-Bangkok, UNICEF-Thailand, the Southeast Asian Ministers of Education Organization (SEAMEO), the Embassy of Australia, and SIL International, invites scholars, government officials, and language practitioners from around the globe to submit abstracts for an international conference on “National Language Policy: Language Diversity for National Unity,” July 4-5, 2008, in Bangkok, Thailand. Keynote speakers will include Bernard Spolsky, Joseph LoBianco, David Bradley, and other noted language policy experts. Conference topics will include (but are not limited to):
• Case studies in language planning and language policy;
• English education in non-English speaking countries;
• Language policy and conflict resolution;
• Language policy and refugee/transnational migration situations;
• Bilingual and multilingual education policies and practice;
• Language policy and ethno-linguistic minority languages;
• Translation policies and practices on the international, national, and local levels;
• Corpus planning for national languages;
• Language policy and the UN Millennium Development Goals
• Language policy and socio-economic theory (with special emphasis on the Self-Sufficiency Economic Theory of His Majesty King Bhumibol of Thailand)

Deadline for electronic submission of abstracts: March 15, 2008
Notification of acceptance: April 15, 2008
Deadline for submission of papers for publication in proceedings: August 31, 2008

• Participant – Thai Resident Baht 2,000 /person
• Student – Thai Resident Baht 1,000 /person
• Non-Thai Resident
- before April 30, 2008 US $ 150 /person
- after April 30, 2008 US $ 175 /person
Accommodations: TBA

This was originally scheduled for February. I submitted an abstract back in December, and I have unofficial word that my talk has been accepted. Let's hope I don't get bumped by some bigger fish in the meantime. Here is my abstract, for anyone interested:

Thai in Transition, and the Thai Gigaword/Terabyte Web Corpus

The Thai language has always been in transition; the result of both external influence and internal innovation. Over the past 1,000 years, Chinese, Khmer, Mon, Sanskrit, Pali, and English have all left substantial marks on the language. Thai has changed internally as well, with new patterns of nominalization, usage of passive forms, and even types of classifiers all appearing in the last century or two. These changes are sometimes prescriptive, as in the ‘linguistic’ edicts promulgated by King Mongkut, and in the work of the Royal Institute’s word-coining committees. But they are also entirely spontaneous; created by the tongues and hands of more than 60 million Thai speakers and writers.

It is only within the past few years, however, that we are able to actively observe the transition of Thai via the Internet. We estimate that at the end of 2007, roughly 100 million Thai-language Web pages have been indexed by the major search engines (based on counts of the extremely common word ที่). A conservative estimate of 1,000 Thai characters per page implies that this Web corpus includes some 100 billion characters, or more than 25 billion words. Thus, we are already able to consult a gigaword corpus 25 times over, and can anticipate access to a terabyte corpus within a few years.

This paper discusses the Thai Web Corpus, part of the SEAlang Library’s suite of tools and lexical resources for Southeast Asian language study. We describe its design and operation, explain how it differs from other Thai corpus projects, and show how it is used to explore two new works by the Royal Institute: the Dictionary of New Words (2007), and Foreign Words that Can be Replaced with Thai Words (2006).

[Thanks to Jason for reminding me about this.]


  1. I'm a little confused. By "Thai Web Corpus" do you mean a corpus SEAlang are making by indexing webpages (ala Google) ? Or do you mean using Google itself ?

    I think a large Thai corpus made from the web would very useful for computational linguistic research, as using Google itself seems pretty limited. For instance, using it to find classifiers for nouns as I mentioned in a previous post is pretty much a non starter as it doesn't give you the search flexibility needed (i.e. regular expressions).

    Maybe I should go to the talk to find out more about what's going on! As in general, this 'web corpus' is an area I'm very interested in too, I've got one made of the text of 150,000 webpages (growing every day as the indexer churns away!). I'm using it for work on t2e, it seems to work really well for identifying new words (over 10,000 so far) and for generating millions of n-grams. I haven't done an average 'Thai letter per page count' on it yet, but from a brief look now I would say the average must be at least 3 or 4 thousand per page, so there's certainly a fair amount there.

    I'm interested in more details on the SEAlang project if you can post them. If we've got the same kind of goals then perhaps we could find a way co-operate on it.


    Oh by the way, I forget to comment on the post before to say congratulations on the baby! Hope you're managing to get some sleep by now ;)

  2. For those who might be interested in dictionaries I highly recommend two books by Simon Winchester regarding the Oxford English Dictionary:

    "The Meaning of Everything: The Story of the Oxford English Dictionary", and

    "The Professor and the Madman: A Tale of Murder, Insanity, and the Making of The Oxford English Dictionary".

    These are very readable books and today's reader who is steeped in the use of computers and databases can hardly appreciate the difficulty of compiling such a voluminous work as the Oxford English Dictionary.

  3. @Mike: It's a web-as-corpus tool, an interface for using either Google or Yahoo to analyze the use of a certain word or phrase on the internet. It has a few tricks beyond a basic internet search. Plus, you can't take the raw number of hits Google gives you literally, so it gives some other tools for analysis (short of doing what you're doing--actually churning through the web and indexing everything).

    I'm not the one who actually created the tool, that would be Doug Cooper. I'm not a programmer. I'm a language guy with a strong interest in how we can use computers to enhance our study of language.

    (And thank you for the congrats. I manage to get an hour of sleep here and there.)

    @Bui: I've read the second of those, and it's a great read. On the general topic of interesting books about dictionaries, I'd also recommend Caught in the Web of Words: James Murray and the Oxford English Dictionary by James' granddaughter K.M. Elisabeth Murray, and The Story of Webster's Third: Philip Gove's Controversial Dictionary and its Critics by Herbert C. Morton.

  4. We met there. Your idea was great. Do you know whether the proceedings have been published or will be published? jeanpacquement@hotmail.com

  5. Jean, I've just sent you an email.