Optical character recognition (OCR) is the conversion of images containing text to machine-encoded text. A popular tool for this is the open source project Tesseract. Tesseract can be used as standalone application from the command line. Alternatively it can be integrated into applications using its C++ API. For other programming languages various wrapper APIs are available. In this post we will use the Java Wrapper Tess4J.
We start with adding the Tess4J maven dependency to our project:
Next we need to make sure the native libraries required by Tess4j are accessible from our application. Tess4J jar files ship with native libraries included. However, they need to be extracted before they can be loaded. We can do this programmatically using a Tess4J utility method:
With LoadLibs.extractTessResources(..) we can extract resources from the jar file to a local temp directory. Note that the argument (here win32-x86-64) depends on the system you are using. You can see available options by looking into the Tess4J jar file. We can instruct Java to load native libraries from the temp directory by setting the Java system property java.library.path.
Other options to provide the libraries might be installing Tesseract on your system. If you do not want to change the java.library.path property you can also manually load the libraries using System.load(..).
Next we need to provide language dependent data files to Tesseract. These data files contain trained models for Tesseracts LSTM OCR engine and can be downloaded from GitHub. For example, for detecting german text we have to download deu.traineddata (deu is the ISO 3166-1-alpha-3 country code for Germany). We place one or more downloaded data files in the resources/data directory.
Now we are ready to use Tesseract within our Java application. The following snippet shows a minimal example:
First we create a new Tesseract instance. We set the language we want to recognize (here: german). With setOcrEngineMode(1) we tell Tesseract to use the LSTM OCR engine.
Next we set the data directory with setDatapath(..) to the directory containing our downloaded LSTM models (here: resources/data).
Finally we load an example image from the classpath and use the doOCR(..) method to perform character recognition. As a result we get a String containing detected characters.
For example, feeding Tesseract with this photo from the German wikipedia OCR article might produce the following text output.
Tesseract is a popular open source project for OCR. With Tess4J we can access the Tesseract API in Java. A little bit of set up is required for loading native libraries and downloading Tesseracts LSTM data. After that it is quite easy to perform OCR in Java. If you are not happy with the recognized text it is a good idea to have a look at the Improving the quality of the output section of the Tesseract documentation.
You can find the source code for the shown example on GitHub.