Documentation for integrating Tesseract (OCR) into Apertium

From Apertium
Jump to navigation Jump to search

Introduction

This article provides helpful information to integrate Tesseract-OCR1 into Apertium.

Tesseract could be integrated into the website and also as part of the Apertium app for Android.


Tesseract into Apertium website

Tesseract can be integrated into the website with an option to use a picture to identify text in it and translate it. Below some information about different procedures and info:

Language Page
HTML5 or JavaScript Progur.com
PHP Sitepoint.com
Python (getting started -> can be used with django) FreeCodeCamp.org


Tesseract into Apertium app

The Apertium Offline translator is primarily written in Java.

For that, we can use the ideas in this video4. We could also put a list of the downloadable packages for Tesseract3 (e.g. create a link to download locally, for example, the package 'spa' shown here3, to be able to identify by the app texts in Spanish).

Code shown in the video:

import net.sourceforge.tess4j.Tesseract;

import java.io.File;

public class OcrReader {
	
	public static void main(String[] args) throws Exception {
		String inputFilePath = "F:/Tesseract/English.tif";

		Tesseract tesseract = new Tesseract();

		String fullText = tesseract.doOCR(new File(inputFilePath));

		System.out.println(fullText);
	}
}

The solution for this is simple, we should change the path for the image written directly into the code by an input where the user could change the path for the image that wants to use without rewriting code (e.g. a drop list, a text input, a menu...).

Set language and default data path (setLanguage(), setDataPath()):

public class OcrReader {
	public static void main(String[] args) throws Exception {
		Tesseract tesseract = new Tesseract();

		tesseract.setDatapath("F:/Tesseract/");
		tesseract.setLanguage("chi_sim");

                String fullText = tesseract.doOCR(new File(inputFilePath));

		System.out.println(fullText)
	}
}

References

1. https://opensource.google.com/projects/tesseract

2. https://github.com/tesseract-ocr/tesseract

3. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

4. https://www.youtube.com/watch?v=58oG5Z8_0r4

5. https://priyankvex.wordpress.com/2015/09/02/making-an-ocr-app-for-android-using-tesseract/

6. https://www.codepool.biz/making-an-android-ocr-application-with-tesseract.html