November 6, 2010

Project Gutenberg Thailand: Some nitty gritty (and a sample Thai ebook)

[Read the PGT 2014 Update!]

In my previous post on Project Gutenberg Thailand, I avoided getting too much into the technical details. Here are more of my thoughts on the steps of the book digitization process.

Scanning:

I have many public domain Thai books in my personal collection, including works by the authors I mentioned in my previous post, as well as numerous literary works by Rama VI, books by Prince Damrong, and so forth.

Many older books have had recent printings, and so they are neither rare nor expensive. In such cases my preferred method for scanning is deconstruct the binding of the book and scan each page using a flatbed scanner. This produces the best quality image, with no shadow or distortion.

This is not always possible, of course, if the book is old or rare, or doesn't belong to me. For such cases I use an OpticBook scanner, which is a special book scanner that allows you to scan one page at a time without destroying the book or breaking its spine. It looks like this, by the way:


In my free time I've scanned several public domain Thai books, and have many more ready and waiting to be scanned.

Photocopying books, say from the library, also works, but scanning is preferable because OCR software works best with grayscale images of 300 dpi resolution or higher. A photocopier is black and white, and ultimately you have to scan the photocopies into a computer anyway. The main benefit of photocopying is that Thai libraries offer cheap photocopy services, so you can have a pro do the heavy lifting for you, so to speak.

In addition, there are excellent resources like the Thammasat Electronic Rare Books site, which have black and white PDF files of many old books. Not all of these are actually public domain, but sites like this provide another avenue for public domain source material that can be digitized as text.

OCR software:

To date the best OCR software for Thai is ABBYY FineReader Professional, either version 9 or 10. In fact, ABBYY is the only one that's any good. I have used it extensively.

A program called ArnThai was released a few years ago by Thailand's National Electronics and Computer Technology Center (NECTEC), but unfortunately ArnThai is rather terrible. The quality of the OCR is not very good, but also it has no batch processing of any kind, supports limited input and output filetypes, and has no mechanism for training characters at all. ABBYY FineReader, while not perfect, has sophisticated tools for all of these things.

OCR accuracy for Thai is well above 90%, but it still has problems. ABBYY has trouble with older typefaces, for instance. And even on newer books it still has some trouble accurately detecting all superscript and subscript characters, or differentiating very similar characters. This is par for the course. Since it has a training mechanism, though, a human can teach the software how to properly recognize difficult typefaces when needed, which greatly improves the quality.

Proofreading:

The basic mechanism for crowdsourced proofreading is to show the user the original page image side-by-side with the text output from the OCR software. (Or, if the text was manually typed, with that.) Here's an example taken from pgdp.net:


The user make corrections and submits the page when it is completed. The process that pgdp.net uses is sometimes overly complicated. Every page is checked about a dozen times, each time focusing on different things (basic text accuracy, formatting, etc.). They have the luxury of plentiful volunteers.

To start with, at least, Project Gutenberg Thailand will not have this luxury. The method I propose is like this:

Each page is proofread two times, once each by two different human proofreaders. They make their corrections and submit the page. Ideally, the two versions would be identical, but there will of course be some errors. To identify the errors, the two versions are then compared programmatically, to find discrepancies between them. Any place where they differ can be assumed to be an error made by one proofreader or the other. The points of discrepancy are highlighted and shown to a third person, who can quickly correct only the highlighted points, thereby fixing mistakes made by the original two proofreaders. At that point the error rate on the page will be extremely low.

Distribution:

Once books are turned into digital text, they can be formatted into the various ebook formats for reading on computer, cell phone, or ebook reader, as well as being available on the website as text and HTML files. I believe public domain works should be distributed far and wide, of course, so I would be pleased to see any output of Project Gutenberg Thailand also posted to Thai Wikisource, as well as any other website that wished to host it.

As a test, I took the electronic text of the short book Letters from Jangwangram จดหมายจางวางหร่ำ, a 1905 epistolary novel, and created an EPUB file. I then transferred it onto my iPhone and loaded it into Stanza, a free ebook reading app. I'm happy to say it is quite readable, even though the lack of word breaks in Thai creates some odd spacing. I also loaded it onto my Kindle 3, and though readable, the default font was not ideal, and there were similar spacing issues.

You can download this EPUB on your own devices if you'd like to test it out:

จดหมายจางวางหร่ำ
โดย น.ม.ส.
Letters from Jangwangram
by N.M.S. (pen name of Prince Bidyalankarana)
jotmai-jangwangram.epub

(Note: This ebook is not in Unicode. It uses HTML entities because some ebook formats are not Unicode compatible yet. I'm still learning the particulars of the several popular ebook formats.)

7 comments:

  1. Wow!!! This is great!!!

    Your work would not only help to disseminate public domain books to wider Thai public but would also greatly be beneficial to visually impaired Thai people who can't access print copies.

    Thank you very much for initiating this Thai version of Project Gutenberg.

    ReplyDelete
  2. I've joined the group and would like to get on board helping.

    I really appreciate public domain books in English and think it would be great if they were available in Thai too.

    I'll have a go with จดหมายจางวางหร่ำ

    As an aside, I find Thai appears pretty badly on my kindle (3), so normally save Thai stuff for it as a six inch pdf. (I don't want to jailbreak it.)

    ReplyDelete
  3. Have you changed the Thai font in your Kindle?

    ReplyDelete
  4. @blind bookworm: That's an excellent point that I hadn't really considered. Yet another reason to make sure we get the ball rolling.

    @kit: Thanks for the tip about making 6 inch pdfs, I'll have to give that a try.

    @Krungthep Kaki: I was under the impression that changing the Thai font requires hacking/jailbreaking the device. Am I mistaken?

    ReplyDelete
  5. I guess so. When you complained about the way the Thai font was, I thought maybe you meant that you had changed it. I'm seriously considering jailbreaking mine because the Thai font is not good, and I wanted to hear your experience. The only thing holding me back is the total lack of e-books in Thai in formats other than pdf.

    ReplyDelete
  6. Great initiative. I've been working to get Philippine books on-line, and set-up http://www.gutenberg.ph -- although I am mostly riding on the PGDP.net and Project Gutenberg infrastructure; The very sample you show is a Cebuano dictionary I am preparing for Project Gutenberg. (BTW: For Thai dictionaries, you already have a great resource: http://sealang.net) The one handicap you have is that the main site does not support the Thai script (The Spanish effectively eradicated almost all native scripts from the Philippines, so we don't have that problem), but you might be able to get a site running that does support Unicode.

    ReplyDelete
  7. Derrick Streng3/16/2011 11:57 AM

    Have you tried google's tesseract-ocr? It is fully trainable, but it is a command line only utility. Still, it's free and (for English anyway) has been ranked in the top three OCR programs available.

    http://code.google.com/p/tesseract-ocr/

    Windows executables are available, but for Linux it must be compiled from source.

    ReplyDelete