[Read the PGT 2014 Update!]
In my previous post on Project Gutenberg Thailand, I avoided getting too much into the technical details. Here are more of my thoughts on the steps of the book digitization process.
I have many public domain Thai books in my personal collection, including works by the authors I mentioned in my previous post, as well as numerous literary works by Rama VI, books by Prince Damrong, and so forth.
Many older books have had recent printings, and so they are neither rare nor expensive. In such cases my preferred method for scanning is deconstruct the binding of the book and scan each page using a flatbed scanner. This produces the best quality image, with no shadow or distortion.
This is not always possible, of course, if the book is old or rare, or doesn't belong to me. For such cases I use an OpticBook scanner, which is a special book scanner that allows you to scan one page at a time without destroying the book or breaking its spine. It looks like this, by the way:
In my free time I've scanned several public domain Thai books, and have many more ready and waiting to be scanned.
Photocopying books, say from the library, also works, but scanning is preferable because OCR software works best with grayscale images of 300 dpi resolution or higher. A photocopier is black and white, and ultimately you have to scan the photocopies into a computer anyway. The main benefit of photocopying is that Thai libraries offer cheap photocopy services, so you can have a pro do the heavy lifting for you, so to speak.
In addition, there are excellent resources like the Thammasat Electronic Rare Books site, which have black and white PDF files of many old books. Not all of these are actually public domain, but sites like this provide another avenue for public domain source material that can be digitized as text.
To date the best OCR software for Thai is ABBYY FineReader Professional, either version 9 or 10. In fact, ABBYY is the only one that's any good. I have used it extensively.
A program called ArnThai was released a few years ago by Thailand's National Electronics and Computer Technology Center (NECTEC), but unfortunately ArnThai is rather terrible. The quality of the OCR is not very good, but also it has no batch processing of any kind, supports limited input and output filetypes, and has no mechanism for training characters at all. ABBYY FineReader, while not perfect, has sophisticated tools for all of these things.
OCR accuracy for Thai is well above 90%, but it still has problems. ABBYY has trouble with older typefaces, for instance. And even on newer books it still has some trouble accurately detecting all superscript and subscript characters, or differentiating very similar characters. This is par for the course. Since it has a training mechanism, though, a human can teach the software how to properly recognize difficult typefaces when needed, which greatly improves the quality.
The basic mechanism for crowdsourced proofreading is to show the user the original page image side-by-side with the text output from the OCR software. (Or, if the text was manually typed, with that.) Here's an example taken from pgdp.net:
The user make corrections and submits the page when it is completed. The process that pgdp.net uses is sometimes overly complicated. Every page is checked about a dozen times, each time focusing on different things (basic text accuracy, formatting, etc.). They have the luxury of plentiful volunteers.
To start with, at least, Project Gutenberg Thailand will not have this luxury. The method I propose is like this:
Each page is proofread two times, once each by two different human proofreaders. They make their corrections and submit the page. Ideally, the two versions would be identical, but there will of course be some errors. To identify the errors, the two versions are then compared programmatically, to find discrepancies between them. Any place where they differ can be assumed to be an error made by one proofreader or the other. The points of discrepancy are highlighted and shown to a third person, who can quickly correct only the highlighted points, thereby fixing mistakes made by the original two proofreaders. At that point the error rate on the page will be extremely low.
Once books are turned into digital text, they can be formatted into the various ebook formats for reading on computer, cell phone, or ebook reader, as well as being available on the website as text and HTML files. I believe public domain works should be distributed far and wide, of course, so I would be pleased to see any output of Project Gutenberg Thailand also posted to Thai Wikisource, as well as any other website that wished to host it.
As a test, I took the electronic text of the short book Letters from Jangwangram จดหมายจางวางหร่ำ, a 1905 epistolary novel, and created an EPUB file. I then transferred it onto my iPhone and loaded it into Stanza, a free ebook reading app. I'm happy to say it is quite readable, even though the lack of word breaks in Thai creates some odd spacing. I also loaded it onto my Kindle 3, and though readable, the default font was not ideal, and there were similar spacing issues.
You can download this EPUB on your own devices if you'd like to test it out:
Letters from Jangwangram
by N.M.S. (pen name of Prince Bidyalankarana)
(Note: This ebook is not in Unicode. It uses HTML entities because some ebook formats are not Unicode compatible yet. I'm still learning the particulars of the several popular ebook formats.)