This version of the Delicious website will be shutdown by April 2014.
We encourage users to switch to the new Delicious site at http://delicious.com , which features a responsive design for mobile and tablet users, offline access, faster loading, and more.

Linux OCR Software Comparison [splitbrain.org]

http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

I wanted to see how recognition rates differ between the tools and created some very simple images. I took the last stanza of Edgar Allan Poe's “The Raven” and put in an image using different fonts. To make it a tiny bit more complicated I also created a gray scale version with lesser contrast of the same images.

This is the original text:

And the raven, never flitting, still is sitting, still is sitting On the pallid bust of Pallas just above my chamber door; And his eyes have all the seeming of a demon's that is dreaming, And the lamp-light o'er him streaming throws his shadow on the floor; And my soul from out that shadow that lies floating on the floor Shall be lifted - nevermore!

And this is how the resulting images looked like:

They all have 300 dpi, the text isn't distorted or arranged in multiple columns, the language is English in pure

ASCII

-7 and there is no image noise at all. Okay, the “Justy” font isn't your everyday printed font, but resembles a really clean handwriting. Overall this is a really basic task for OCR. Or so I thought.

Let's have a look at the results first:

Recognition scores where calculated by dwdiff's statistic output comparing the original text with the OCR output.

As you can see, the commercial Abbyy software has absolutely no problems with the printed fonts, but fails at the handwriting. It is the slowest of all tested tools, but keep in mind that it also reads nearly any image format, while you probably need to convert your images for the other tools first.

If you prefer a free OCR software, than tesseract is indeed as good as its reputation. Note that I used the most recent version, built from SVN here. Tesseract was a commercial product that was developed in the early nineties and later was bought and open sourced by Google. It is pretty picky about the input image's format, but once you got that right the results are decent enough.

Comments