Integrating Tesseract OCR into Apertium

From Apertium
Jump to navigation Jump to search

Introduction

This article describes an overview of Tesseract and what it would take to theoretically integrate it into Apertium.

Tesseract is an open source engine for converting pictures into text, said to be one of the most accurate programs of its type. It was originally developed by Hewlett Packard and then later sponsored by Google. Licensed under the Apache License 2.0, it could be modified and distributed anywhere as long as said license remains with all derivates and as long as all due credits are given.

Integrating it with Apertium presents many possibilities, such as a direct camera-to-translation feature.

Word recognition

Tesseract "reads" text by first analyzing input image for outlines and detecting spacing as well as proportions. Then it attempts to recognize each word in turn by scanning, twice, with the aid of existing trained models. Successful tries are saved into new training data.[1]

Below are several trained data model choices for Tesseract 4.0.0[2]. User contributions can also be found here.

  • tessdata_fast, integerized and shipped by default with Tesseract.
    • Only the new LSTM-based OCR engine is supported.
    • Whether it's the best option or not differs by language - for most languages, it is not.
    • 8-bit, so incremental training or fine-tuning is not possible.
    • includes both single-language and multilingual-script models
  • tessdata_best, significantly slower
    • Only the new LSTM-based OCR engine is supported.
    • For most languages, slightly better accuracy.
    • Better for certain retraining scenarios for advanced users.
  • tessdata.
    • for the legacy tesseract engine (--oem 0) as well as the new LSTM neural net based engine (--oem 1).

References