July 6, 2008

Slides from my conference talk

My talk yesterday went well, titled Thai in Transition, and the Thai Gigaword/Terabyte Web Corpus, and well attended by those who I wanted to hear it. For those just tuning in, the conference was National Language Policy: Language Diversity for National Unity, hosted by the Royal Institute and several other organizations in Bangkok this past Friday and Saturday.

I've posted a PDF of my Powerpoint slides here.

See my previous posts here and here (the talk abstract is at that last link).


  1. Rikker,

    Here is a sample from the "Corpus".

    (2 / 50%) ตามสัญชาตญาณ_ อาชีพ จึงหยุดรถคว้ากล้องลงไปดู ตามสัญชาตญาณ_ > ไม่เพียงแต่คนขับสามล้อเท่านั้ ว์นี่ไม่ได้เป็นวัฒนธรรม เขากิน ตามสัญชาตญาณ_ > ไม่มีว่านกจับหนอนแล้วไปโยนใส่

    How do I read this? Thanks for all your very hard work in this project.

  2. This example is from the "fixed" corpus, not the web-based corpus, by the way. That's why you're getting so few results. There are a number of different sub-corpora you can select from at the bottom of the left-hand control panel, the default being a corpus of news stories.

    So anyway, this shows the collocates (aka the "word neighbors"), or which other words commonly appear with the phrase you searched. A leading collocate means a word which comes before your phrase, and trailing collocate is one that comes after.

    Here the underscore _ means a space--so in this case, the "collocate" is really just a space, which is to say that in 2 instances, or 50% of the time (like I said, small corpus, which is why 2 hits is 50%), a space immediately follows the phrase you searched.

    Next it gives you the instances it found. It pulls out a predefined character-size sample, which you can control with the "context size" box on the left. It doesn't check to see where words/sentences begin or end, but just pulls out the word and collocates in their raw context.

    In this case it looks like it amounts to about 30 characters of text on either side of your search phrase.

    As with pretty much anything on SEAlang, there's way more functionality than there is polish.

    In the near future there will start to be screen casts on the site to demonstrate how to use some of the basic tools and advanced features.

  3. Thanks for posting this and all your effort on the blog.x