January 8, 2014

Project Gutenberg Thailand: 2014 Update

[Read my old posts on Project Gutenberg Thailand herehere, and here.]

After lying dormant for far too long, Project Gutenberg Thailand rides again! The following public domain books about Thailand, digitized by PGT volunteers, are now live on the main Project Gutenberg site:
You can download them in a variety of formats, including Epub and Kindle. A few more books have been fully proofread and are going through post-processing. They will be online soon. Several others are still in progress in Unbindery, the software we use to crowdsource the proofreading. And hundreds more are just waiting to be digitized.

If you have 5 minutes to spare, and are interested in helping digitize more ebooks (in English or Thai), you can sign up at http://gutenbergthai.org. You only take on one page at a time, so you can spend as much or as little time as you want doing it. Guidance on proofing can be found on our mostly defunct Google Group.  (Note: There is no main website for PGT at the moment, but Unbindery is still functional, generously hosted by Ben Crowder.)

In addition to many more books, I still plan to get a proper website made for PGT at some point. If you have any skills in that area and are interested in helping, drop me a line at gutenbergthai [at] gmail [dot] com.

April 2, 2011

Project Gutenberg Thailand: Beta Testers Wanted

[Read the PGT 2014 Update!]

[Update: Signup instructions and basic proofreading guidelines can now be found here. A big thank you to everyone who has shown interest in PGT.]

I'm looking for a few beta testers to help me work out kinks in a proofreading web application for Project Gutenberg Thailand.

I wrote a few months ago (see part 1 and part 2)  about my long-held goal to setup a website dedicated to digitizing and disseminating public domain Thai ebooks. This new site would be called Project Gutenberg Thailand (PGT), an independent sister site to Project Gutenberg. The focus of PGT would be both books in Thai and books about Thailand whose copyrights have expired.

At the time I didn't have a web developer to help me get the ball rolling. Shortly after making those posts, my friend Ben Crowder (who has actually helped me since the earliest planning stages of the project back in 2007) stepped up to the plate and created a web application for proofreading or typing old books. The goal was to keep it simple.

Right now I need two types of beta testers:

1) Thai typists (native or advanced Thai reading/typing skill).

[EDIT: I've now uploaded PGT's first book for Thai OCR proofreading, using a newer printing of the epic poem ขุนช้างขุนแผน Khun Chang Khun Phaen. Volunteers working on Thai can now choose between a typing project or a proofreading project (or do both).]

2) English proofreaders (native or advanced English reading skill).

Due to the limits of Thai OCR, the text of older Thai books needs to be typed by hand. OCR for English is quite good, however, and needs only to be proofread and corrected against the original page image.

You do not need a large amount of free time. That's the beauty of distributed typing/proofreading. You do a page here, a page there, as time allows. So if you have some free time and are interested in volunteering some time to support this project, contact me by email (rdockum at gmail) or on Twitter (@thai101). As a beta tester I'll expect you to report bugs if you encounter them, suggest features you wish to see added, and so forth. There is no commitment, it's strictly voluntary. You know, sabai-sabai.

November 12, 2010

Better expat living through technology

I don't discuss expat life in Thailand much on this blog, probably because there are others who discuss it more insightfully and entertainingly, like my pal Greg (of Bangkok Podcast and Greg to Differ fame).

Lately, though, I've been thinking about the subject. Even when you speak the language, as I do, and get used to (most of) the quirks of your adoptive culture, life in a foreign country can still be difficult.

I've always liked technology, but I've never been an early adopter, nor had the budget for many gadgets. I bought my first iPod in 2007, and my first smart phone, an iPhone 4, not even two months ago. I also bought a Kindle 3 around the same time.

These new additions to my growing gadget collection made me stop and consider how technology has really improved my quality of life as an expat in small but remarkable ways.

Here are a few of ways that technology improves my life in the big, foreign city:

The daily grind: I have a 30-40 minute commute in the morning, and often longer at night. The ease of the iPod for loading podcasts and audiobooks really makes that time fly. Before I owned an iPod I used to download mp3 files for podcasts and burn them to disc to play in my car stereo. I know, right?

Living outside the US also means, if the entertainment cartels have their way at least, missing out on excellent services like Pandora, Rdio, and Last.fm Radio (that last one you can get here, but it's not free like in the US). Thanks to VPNs, I can tunnel through to use these services when I really want to. I even use a VPN service on my iPhone, so I can use Pandora or Rdio on there. But it's still kind of a hassle, and the VPN connection cuts out sometimes. [Dear Thai government: I was just joking about using VPNs. Everybody knows those are illegal here and of course I would never, ever actually use one. *twitch*]

Literally two days ago, though I found another service that solves my longstanding problem of having a huge music library, but having no desire to take the time to divide it into playlists, or swap music in and out of my iPod/iPhone. The service is Audiogalaxy, and what it does is simple: it streams your music from your computer to your iOS (or Android) device. As of yesterday, I now have my full music library (100GB+) at my fingertips anywhere in Bangkok. This blows my mind. It's a game changer for me, and pretty much the definition of a killer app.

Social life: Sometimes I feel like I'm simultaneously a misanthrope and a social butterfly. Despite being married with two kids, I enjoy my alone time. But I do miss hanging out with friends from back home. Bangkok is my wife's hometown, so her high school, college, work friends -- they're mostly all still in Bangkok. It's an enviable situation that even people in the US don't really enjoy. One of the perks of a one-big-city country. Bangkok is the center of the Thai universe, and a sort of black hole that sucks everyone to it, to boot.

How does this relate to technology? One word: Twitter. Before Twitter I was much more of a solitary expat. I had friends but I didn't see or really even communicate with them that much. I've never lived in downtown Bangkok, and I've never frequented the Bangkok social scene. Never really been my style. This has started to change ever-so-slightly, though. Thanks to Twitter I've met quite a lot of great people who I don't think I would've met otherwise, and some have become my good friends. (There are still a few positions available--just fill out this form and submit an 8x10 glossy headshot if you want to apply.)

Keeping in touch: Obviously email, blogging, Facebook and all of that help me keep in touch with my family and friends from back home. But those are old hat. It's Skype that has really changed the way I communicate. Nowadays I use Skype-In to rent a local US number in my hometown, and use a little USB connector to hook my Skype up to an actual phone in my house in Bangkok. Now I have a number that anyone in the US can use to call me at home, and when I'm not there Skype forwards the call to my cell phone. It's pretty incredible, and it costs me single-digit dollars per month.

Not only Skype, though, but just having a smart phone makes it easier to keep in regular touch. A few weeks ago day my daughter was singing to herself and made up a cute song about her younger brother. I busted out my iPhone, opened the voice recorder app, and recorded it on the spot. From there, I edited out a few seconds on either side, and with another click or two I had emailed the clip to Grandma back in the US. It's not that I couldn't do any of this stuff before, but I just didn't because I'm lazy and it wasn't simple enough.

I'm also surprised by how much I dig video chat on my shiny new iPhone. My bestest friend since I was a dork-tastic 8-year-old recently got an iPhone 4, too, and while he and I would occasionally chat or call each other before, it's a totally different beast to be able to have a face-to-face conversation with him anywhere I go (that has wifi). It's been pretty great, and helps to quell the occasional uprising of mild homesickness. I hope Apple opens up the protocol to other devices. (In the meantime Tango offers cross-platform video chat, though.)

To sum up, none of this stuff I can do now is bleeding edge tech, it's just the convergence of many cool technologies that make life better, and make Bangkok seem a lot less far away from home.

November 6, 2010

Project Gutenberg Thailand: Some nitty gritty (and a sample Thai ebook)

[Read the PGT 2014 Update!]

In my previous post on Project Gutenberg Thailand, I avoided getting too much into the technical details. Here are more of my thoughts on the steps of the book digitization process.

Scanning:

I have many public domain Thai books in my personal collection, including works by the authors I mentioned in my previous post, as well as numerous literary works by Rama VI, books by Prince Damrong, and so forth.

Many older books have had recent printings, and so they are neither rare nor expensive. In such cases my preferred method for scanning is deconstruct the binding of the book and scan each page using a flatbed scanner. This produces the best quality image, with no shadow or distortion.

This is not always possible, of course, if the book is old or rare, or doesn't belong to me. For such cases I use an OpticBook scanner, which is a special book scanner that allows you to scan one page at a time without destroying the book or breaking its spine. It looks like this, by the way:


In my free time I've scanned several public domain Thai books, and have many more ready and waiting to be scanned.

Photocopying books, say from the library, also works, but scanning is preferable because OCR software works best with grayscale images of 300 dpi resolution or higher. A photocopier is black and white, and ultimately you have to scan the photocopies into a computer anyway. The main benefit of photocopying is that Thai libraries offer cheap photocopy services, so you can have a pro do the heavy lifting for you, so to speak.

In addition, there are excellent resources like the Thammasat Electronic Rare Books site, which have black and white PDF files of many old books. Not all of these are actually public domain, but sites like this provide another avenue for public domain source material that can be digitized as text.

OCR software:

To date the best OCR software for Thai is ABBYY FineReader Professional, either version 9 or 10. In fact, ABBYY is the only one that's any good. I have used it extensively.

A program called ArnThai was released a few years ago by Thailand's National Electronics and Computer Technology Center (NECTEC), but unfortunately ArnThai is rather terrible. The quality of the OCR is not very good, but also it has no batch processing of any kind, supports limited input and output filetypes, and has no mechanism for training characters at all. ABBYY FineReader, while not perfect, has sophisticated tools for all of these things.

OCR accuracy for Thai is well above 90%, but it still has problems. ABBYY has trouble with older typefaces, for instance. And even on newer books it still has some trouble accurately detecting all superscript and subscript characters, or differentiating very similar characters. This is par for the course. Since it has a training mechanism, though, a human can teach the software how to properly recognize difficult typefaces when needed, which greatly improves the quality.

Proofreading:

The basic mechanism for crowdsourced proofreading is to show the user the original page image side-by-side with the text output from the OCR software. (Or, if the text was manually typed, with that.) Here's an example taken from pgdp.net:


The user make corrections and submits the page when it is completed. The process that pgdp.net uses is sometimes overly complicated. Every page is checked about a dozen times, each time focusing on different things (basic text accuracy, formatting, etc.). They have the luxury of plentiful volunteers.

To start with, at least, Project Gutenberg Thailand will not have this luxury. The method I propose is like this:

Each page is proofread two times, once each by two different human proofreaders. They make their corrections and submit the page. Ideally, the two versions would be identical, but there will of course be some errors. To identify the errors, the two versions are then compared programmatically, to find discrepancies between them. Any place where they differ can be assumed to be an error made by one proofreader or the other. The points of discrepancy are highlighted and shown to a third person, who can quickly correct only the highlighted points, thereby fixing mistakes made by the original two proofreaders. At that point the error rate on the page will be extremely low.

Distribution:

Once books are turned into digital text, they can be formatted into the various ebook formats for reading on computer, cell phone, or ebook reader, as well as being available on the website as text and HTML files. I believe public domain works should be distributed far and wide, of course, so I would be pleased to see any output of Project Gutenberg Thailand also posted to Thai Wikisource, as well as any other website that wished to host it.

As a test, I took the electronic text of the short book Letters from Jangwangram จดหมายจางวางหร่ำ, a 1905 epistolary novel, and created an EPUB file. I then transferred it onto my iPhone and loaded it into Stanza, a free ebook reading app. I'm happy to say it is quite readable, even though the lack of word breaks in Thai creates some odd spacing. I also loaded it onto my Kindle 3, and though readable, the default font was not ideal, and there were similar spacing issues.

You can download this EPUB on your own devices if you'd like to test it out:

จดหมายจางวางหร่ำ
โดย น.ม.ส.
Letters from Jangwangram
by N.M.S. (pen name of Prince Bidyalankarana)
jotmai-jangwangram.epub

(Note: This ebook is not in Unicode. It uses HTML entities because some ebook formats are not Unicode compatible yet. I'm still learning the particulars of the several popular ebook formats.)

Project Gutenberg Thailand: Liberating public domain Thai literature

[Read the PGT 2014 Update!]

For a few years now it's been one of my goals--it's been so long I should probably say 'dreams'--to start Project Gutenberg Thailand, a repository for public domain literature in Thai and about Thailand. The founder of the original Project Gutenberg, Michael Hart, encourages such spin-off sites, and was enthusiastic when I contacted him about the idea back in 2007.

The closest thing that currently exists is Thai Wikisource, but it has little in the way of modern literature, instead having mostly selections of classical Thai verse and public domain government documents. As far as I know, there is no existing movement to identify and disseminate more recent public domain Thai works.

Owing to a number of reasons, however, not the least of which being my own lack of sufficient free time, nothing has ever gotten off the ground.

Last year, frustrated with my inability to make any headway on this project, I began compiling a list of Thai authors whose works are in the public domain. To put it simply, under Thai law a book is copyrighted until 50 years after the author's death.

With such a relatively short copyright term, the works of many well-known 20th century authors have entered the public domain. Unfortunately it too often seems to be those authors who died young. Yakob ยาขอบ, author of the immortal Conqueror of the Ten Directions ผู้ชนะสิบทิศ, died in 1956 at age 48. And two early novelists born in 1905 failed to reach middle age -- Prince Akartdamkoeng มจ.เจ้าอากาศดำเกิง penned such well-remembered tomes as The Circus of Life ละครแห่งชีวิต before killing himself at 25; while Mai Mueangdoem ไม้ เมืองเดิม, of The Old Wound แผลเก่า fame, met his fate at 37. Such authors are no less significant for modern Thai culture than the Fitzgeralds and Steinbecks of American culture, and while they can be still be found in print, it is a shame that such works aren't yet available as free ebooks.

I hear you asking, "So what needs to be done to set these works free?" (I have excellent hearing.) "And how can I help?"

Well, we need to put in place the process for taking the paper books and turning them into electronic text. The basic steps are as follows:

1. Get the book and scan each page, to create images.
2a. Use OCR [optical character recognition] software to turn the images into digital text.
OR
2b. Have humans type out the text contained in the images.
3. Have humans proofread the digital text by comparing it side-by-side with the original image.

This process is well-established for English. Project Gutenberg has a sister website, called Distributed Proofreaders (pgdp.net) that crowdsources this work for books in Latin-alphabet languages. OCR for English is extremely accurate. Until 2009, OCR for Thai was miserably poor, but nowadays it's rather good. In other words, the time is finally right to start digitizing books for Project Gutenberg Thailand.

With the original Project Gutenberg, anyone can sign up to help via the Distributed Proofreaders site. For Project Gutenberg Thailand, we must build a similar community of people willing to contribute a little bit of time here and there to help liberate public domain Thai books from their paper prisons.

So here I am, writing this blog post, hoping to drum up help to get the ball rolling.

The most immediate need is a sympathetic soul with web programming chops to help create the website for crowdsourced Thai proofreading and/or typing. I've put a fair amount of thought into the core features needed, but I lack the programming and design skills needed to make it a reality.

If you are interested in helping with this effort, come join the Google Group, let me know on Twitter at @thai101, or email me at rdockum [at] gmail [dot] com.

June 8, 2010

More podcast appearances

I've been remiss in posting to the blog again lately. The blame lies partially on the trip I took with my family in April to the US, but also on the recent political turmoil, which I got wholly caught up in following.

Recently I've been a guest on a couple of podcast episodes, though. Changkhui in English did a special episode about life during and after the riots for average, non-protesting residents like ourselves. I dragged Greg of Greg to Differ along with me in our (or at least my) video podcast debut.

And if you haven't heard of it yet, check out Greg's new show Bangkok Podcast, which he co-hosts with Tony Joh of thai-faq.com. I recorded multiple segments for future shows, in which we discuss the Thai language and how learning it can change your experience in Thailand. Those will air once per month; the first one went live today.

For iTunes users, check out the shows with these links:
Changkhui in English -- Bangkok Podcast

Good old-fashioned mp3 links:
Changkhui in English 10 -- Bangkok Podcast 4

June 4, 2010

Pibulsonggram's Cultural Mandates

I was somewhat surprised to find that little has been written in English on the web about the so-called Cultural Mandates, also known as the State Decrees (รัฐนิยม ratthaniyom), issued by Field Marshal Plaek Pibulsonggram between 1939 and 1942, during his first term as prime minister. So I figured one way to remedy this would be to start an English Wikipedia article.

These Cultural Mandates were a series of 12 edicts the government issued, utterly changing the face of the country. Though not everything stuck, I think it is hard to underestimate how much influence the mandates had on the development of modern Thailand. They were a remarkable--if questionable--feat of social engineering, though only part of the larger social engineering schemes of Chom Phon Po (จอมพล ป., or "Field Marshal P.").

Among the cultural reforms enacted by the Field Marshal:
  • Changing the name of the country to Thailand, and temporarily eradicating the word 'Siam', including from the royal anthem, traditional titles, and even the names of businesses and organizations.
  • Declaring a new Thai national anthem, still in use today.
  • Playing the national anthem at 8:00 am and 6:00 pm every day.
  • Playing the royal anthem (which used to be the national anthem) before all theatrical shows and requiring patrons to stand during it.
  • Establishing Thai as the national language and forcing non-Thai ethnic groups to learn it.
When you read the mandates themselves the language is often weak -- Thais "should" do X or Y. But in reality these state mandates were backed up with negative incentives or threat of force. It was during these years of nation-building that the Thai state also outlawed Chinese instruction in schools, assessed extra taxes on Chinese-owned businesses, or founded state enterprises designed to compete with and run foreign businesses into the ground.

For example, the Thailand Tobacco Monopoly was founded in 1939 when the state began seizing private tobacco factories. Just months later Mandate 5 was issued, exhorting Thais to use only Thai products and government-run services.

So while it is easy today to see only the positive benefits of many of these mandates, or forget that the mandates ever happened, the actual history is much more complex. Many of Pibulsonggram's mandates, including his simplified Thai alphabet, were scrapped as soon as he was forced to resign (the first time) in 1944.

You can find the original mandates in Thai on the Royal Gazette website by searching the term รัฐนิยม. (The website works best in IE, sadly.) They are also linked directly in the references section of the Wikipedia article.

March 4, 2010

Thai 101 Learner's Series Rides Again

It's been a while since Women Learning Thai finished re-serializing my Thai 101 Learner's Series, which first ran as a biweekly column in the Phuket Gazette during 2008.

The ever-patient Catherine of WLT managed to coax another installment from me, which went live last week. Have a look.