Google recently released Tesseract as open source. This particular OCR engine, called Tesseract, was in fact not originally developed at Google! It was developed at Hewlett Packard Laboratories between 1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest organised by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business and Tesseract has been collecting dust in an HP warehouse ever since. Fortunately some of our esteemed HP colleagues realised a year or two ago that rather than sit on this engine, it would be better for the world if they brought it back to life by open sourcing it, with the help of the Information Science Research Institute at UNLV. UNLV was happy to oblige, but they in turn asked Google for help in fixing a few bugs that had crept in since 1995. Google tracked down the most obvious ones and decided a couple of months ago that Tesseract OCR was stable enough to be re-released as open source.
A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there.
print
save
email
comment
Copyright @ 2004 Software & Support Media
Powered By Media Teknologi Informasi Corp.
Privacy PolicyTerms of Use