Integrating Tesseract OCR into Apertium

From Apertium
Revision as of 16:44, 27 October 2018 by Amuritna (talk | contribs) (stub, outlining, still lacking citations + links)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

This article describes an overview of Tesseract and what it would take to theoretically integrate it into Apertium in two different ways. Both use a necessary additional tool, OpenCV, to increase accuracy.

Tesseract is an open source engine for converting pictures into text, said to be one of the most accurate programs of its type. It was originally developed by Hewlett Packard and then later sponsored by Google. Licensed under the Apache License 2.0, it could be modified and distributed anywhere as long as said license remains with all derivates and as long as all due credits are given. Integrating it with Apertium presents many possibilities, such as a direct camera-to-translation feature.

Pre-processing input images to increase accuracy

Tesseract is very likely to fail without proper image pre-processing, which makes implementing it along with image manipulation software such as OpenCV (a recommended option) a good idea. It also assumes certain things about the input image, such as it being binarized i.e black and white (not to be confused with greyscale). Common pre-processing techniques include binarization (as mentioned above), noise removal, cropping, rescaling, rotation, and de-skewing.

Cropping out irrelevant parts

You do not want to distract the Tesseract from what is important. Cropping out the parts that does not include the text we want, best done manually, is one of the best things you could to increase the OCR’s reliability. Letting (or making sure!) the user cropped for the necessary text before sending out the input image for the rest would be a vital feature to have.

Binarization

Binarization is the process of converting the many colors of an image to exactly two – commonly black and white – and could be thought of as extreme contrast. This could be done using OpenCV with one of its several Image Thresholding functions. The goal is to have a clear high-contrast result that sustains all the wanted information.

Noise removal

Also called denoising, a term which has also been used to name the OpenCV functions used for noise removal.

Rescaling, rotation, and de-skewing

Tesseract has been said to work best with approximately diagonal black-in-white text of at least 20px high, with the image being of at least 300dpi in quality.

Implementing Tesseract within Apertium

This (theoretical) implementation directly integrates Tesseract into Apertium. Both are to be shipped together.

Using the Python wrapper library pytesseract

Shipping Apertium with Tesseract, OpenCV, and pytesseract

Implementing Tesseract in the cloud

Tesseract OCR (and thus OpenCV) does not necessarily have to be directly implemented into and shipped with Apertium. This paper describes an implementation that sets up Tesseract in the cloud, where it communicates with Apertium through internet connection.

This implementation has the advantage of reducing strain on the device Apertium is in, as the combination of Tesseract and OpenCV can be quite heavy on memory usage. However, making Tesseract available only there is internet connection decreases its accesibility.

Setting up Tesseract on a remote server

Letting Apertium communicate with Tesseract

Alternatives and ideas

  • Replacing OpenCV with other image manipulation programs, such as ImageMagick.
  • Implementing Tesseract integration as an optional plug-in.

Resources