How to use tesseract ocr php file

Tesseract OCR: Scan documents with text recognition

David Wolski

Good text recognition (OCR) is sometimes essential. Tesseract OCR for Linux is one of the best freeware tools in its class.

EnlargeThe concept of optical character recognition (OCR) is far older than modern computers.
© © Wilhei, CC BY 3.0

It is not a problem to save texts as image files. The opposite way of turning a scanned or photographed document back into text is not that easy. If column layout, picture elements and many inaccuracies also occur on the scan, then text recognition software must have a highly developed pattern recognition. Only well-engineered OCR programs will produce an acceptable result.

The concept of Optical Character Recognition (OCR) is far older than modern computers. An early patent for converting printed letters into Morse code was granted as early as 1912. The technology gained in importance as a reading aid for the blind and finally made great strides with the spread of the first computers in the 1950s under the aegis of IBM. As a result of the improvement in scanner hardware and software, the OCR process has also become practicable for digitizing general paper documents without special OCR fonts.

Today, text recognition is a productivity tool and there is a reasonable selection of powerful OCR programs from 130 euros for private users. But hardly any of the better-known commercial programs run on Linux. But there is an open source project that is one of the best OCR programs available: Tesseract OCR.

Tip:The best devices for scanning, printing, and copying

High recognition rates with OCR are a must

OCR software must not have high error rates. Since tens of thousands of characters are quickly generated when digitizing entire pages, an error rate of a few percent produces a considerable number of incorrect letters and makes manual reworking necessary. OCR programs therefore aim for a rate of at least 98 percent when recognizing Latin characters. Tesseract OCR is currently the only open source program that plays in this league.

More than twenty years of development went into Tesseract OCR: In 1985, HP began work on text recognition, which by 1994 achieved a precision of 98.7 percent. Although it dwarfed the competition, Tesseract never made it to a finished version that HP could have shipped with its flatbed scanners. From 1995 to 2005 the project was completely dormant and HP finally released Tesseract as open source (Apache license) when it was almost irrelevant. Because for a complete OCR software the automatic layout analysis that could process multi-column text was still missing.

The algorithms used, which process patterns in a pipeline step by step to the finished word, always performed so well that Google took on the project. Google needed an OCR software for the online offer Google Books and developed Tesseract OCR further. Since 2006 the layout analysis has been added as well as character recognition for non-European languages. Tesseract OCR matured to the current version 3.0, which is available in the package sources of all major Linux distributions.

Prepare images

Due to the way Tesseract OCR works, the image size must be selected so that letters are at least 20 pixels high. This corresponds to a resolution of 300 dpi with a font size of ten points. In general, the more pixels, the better. If the scans or photographs are available in a lower pixel density, this must first be brought to a higher resolution using image processing such as Gimp. What Tesseract OCR reacts to are very distorted baselines of text lines and twisted pages, such as those that appear on photographed book pages. These defects should be corrected as much as possible.

Install and use Tesseract on Linux

Tesseract itself only supplies the OCR engine, which works as a command line tool. The aim of the developers is to keep Tesseract OCR so flexible that it can also serve as a central component for other OCR projects. There are also graphical front ends to Tesseract OCR. The first step, however, is to install the text recognition together with the separate language files that provide the pattern recognition with the information it needs.

EnlargeOnly graphical front-ends turn Tesseract OCR into a desktop application.

Reading tip: The right Linux distribution for beginners

In Debian / Ubuntu, install the OCR program with recognition patterns for German and English text with this command:

In Open Suse, the packages have a different name and can include

can be retrofitted. Fedora finally installed with the command

all packages, as English is already included in the basic package. In addition, there are over a hundred other language files for Tesseract OCR 3.0 and also data for special fonts such as Gothic script.

As a pure command line tool, Tesseract OCR expects the transfer of a high-resolution image file (300 DPI) in the formats JPG, PNG, TIFF or BMP. A manual call is made according to this syntax:

For example, so that the image file "scan.jpg" is read in with the text recognition for German-language documents and converted to a file with the name "scan.txt", this command would be necessary:

While the specification of the desired language is always necessary, Tesseract OCR automatically recognizes the image format of the input file and automatically appends the ending ".txt" to the output file. The support for English-language texts is switched on with the parameter "eng" instead of "deu". The operation on the command line is not complicated, but not intuitive enough for working on the Linux desktop in an office environment. It is better to leave this step to one of the graphical front ends.

See also:The best tips & tricks for Office 2016

Gimagereader as a front-end

The best-known program that Tesseract OCR turns into a graphical application with a GUI is the gimagereader. Its surface displays the most important functions in menu items. There is a preview window for the image file and an output window that shows the result after a recognition run. If several scanned image pages from an already created PDF are to be converted into text, Gimagereader can automatically split the entire document into individual pages and pass them on to Tesseract OCR accordingly. There is no need for manual conversion. There is also a scanner interface via Sane to read in documents directly from supported flatbed scanners.

EnlargeWith its many useful functions, Gimagereader is the ideal complement to Tesseract.

The preview function is a great help for pages with high proportions of images and complicated layouts, because a selection box allows only a certain area to be sent to the OCR program.

Another well thought-out extra: Gimagereader can use the hunspell spell checker, which Libre Office also uses, on the resulting text. Because even with very good scans, errors cannot be completely avoided.

As a GTK program, Gimagereader is ideal for Gnome, Unity, Cinnamon and XFCE. The program is in the standard package sources for Ubuntu from version 15.10, in Debian from version 8 and also under Fedora. It's included in the newer Debian or Ubuntu versions