Documentation for integrating Tesseract (OCR) into Apertium

Introduction[edit]

This article provides helpful information to integrate Tesseract-OCR¹ into Apertium.

Tesseract could be integrated into the website and also as part of the Apertium app for Android.

Tesseract into Apertium website[edit]

Tesseract can be integrated into the website with an option to use a picture to identify text in it and translate it. Below some information about different procedures and info:

Language	Page
HTML5 or JavaScript	Progur.com
PHP	Sitepoint.com
Python (getting started -> can be used with django)	FreeCodeCamp.org

Tesseract into Apertium app[edit]

The Apertium Offline translator is primarily written in Java.

For that, we can use the ideas in this video⁴. We could also put a list of the downloadable packages for Tesseract³ (e.g. create a link to download locally, for example, the package 'spa' shown here³, to be able to identify by the app texts in Spanish).

Code shown in the video:

import net.sourceforge.tess4j.Tesseract;

import java.io.File;

public class OcrReader {
	
	public static void main(String[] args) throws Exception {
		String inputFilePath = "F:/Tesseract/English.tif";

		Tesseract tesseract = new Tesseract();

		String fullText = tesseract.doOCR(new File(inputFilePath));

		System.out.println(fullText);
	}
}

The solution for this is simple, we should change the path for the image written directly into the code by an input where the user could change the path for the image that wants to use without rewriting code (e.g. a drop list, a text input, a menu...).

Set language and default data path (setLanguage(), setDataPath()):

public class OcrReader {
	public static void main(String[] args) throws Exception {
		Tesseract tesseract = new Tesseract();

		tesseract.setDatapath("F:/Tesseract/");
		tesseract.setLanguage("chi_sim");

                String fullText = tesseract.doOCR(new File(inputFilePath));

		System.out.println(fullText)
	}
}