Documentation for integrating Tesseract (OCR) into Apertium
Contents
Introduction[edit]
This article provides helpful information to integrate Tesseract-OCR1 into Apertium.
Tesseract could be integrated into the website and also as part of the Apertium app for Android.
Tesseract into Apertium website[edit]
Tesseract can be integrated into the website with an option to use a picture to identify text in it and translate it. Below some information about different procedures and info:
Language | Page |
---|---|
HTML5 or JavaScript | Progur.com |
PHP | Sitepoint.com |
Python (getting started -> can be used with django) | FreeCodeCamp.org |
Tesseract into Apertium app[edit]
The Apertium Offline translator is primarily written in Java.
For that, we can use the ideas in this video4. We could also put a list of the downloadable packages for Tesseract3 (e.g. create a link to download locally, for example, the package 'spa' shown here3, to be able to identify by the app texts in Spanish).
Code shown in the video:
import net.sourceforge.tess4j.Tesseract; import java.io.File; public class OcrReader { public static void main(String[] args) throws Exception { String inputFilePath = "F:/Tesseract/English.tif"; Tesseract tesseract = new Tesseract(); String fullText = tesseract.doOCR(new File(inputFilePath)); System.out.println(fullText); } }
The solution for this is simple, we should change the path for the image written directly into the code by an input where the user could change the path for the image that wants to use without rewriting code (e.g. a drop list, a text input, a menu...).
Set language and default data path (setLanguage(), setDataPath()):
public class OcrReader { public static void main(String[] args) throws Exception { Tesseract tesseract = new Tesseract(); tesseract.setDatapath("F:/Tesseract/"); tesseract.setLanguage("chi_sim"); String fullText = tesseract.doOCR(new File(inputFilePath)); System.out.println(fullText) } }
References[edit]
1. https://opensource.google.com/projects/tesseract
2. https://github.com/tesseract-ocr/tesseract
3. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
4. https://www.youtube.com/watch?v=58oG5Z8_0r4
5. https://priyankvex.wordpress.com/2015/09/02/making-an-ocr-app-for-android-using-tesseract/
6. https://www.codepool.biz/making-an-android-ocr-application-with-tesseract.html