Difference between revisions of "Integrating Tesseract OCR into Apertium"
(stub, outlining, still lacking citations + links) |
|||
Line 1: | Line 1: | ||
== Introduction == |
== Introduction == |
||
This article describes an overview of Tesseract and what it would take to ''theoretically'' integrate it into Apertium |
This article describes an overview of Tesseract and what it would take to ''theoretically'' integrate it into Apertium. |
||
Tesseract is an open source engine for converting pictures into text, said to be one of the most accurate programs of its type. It was originally developed by Hewlett Packard and then later sponsored by Google. Licensed under the Apache License 2.0, it could be modified and distributed anywhere as long as said license remains with all derivates and as long as all due credits are given |
Tesseract is an open source engine for converting pictures into text, said to be one of the most accurate programs of its type. It was originally developed by Hewlett Packard and then later sponsored by Google. Licensed under the Apache License 2.0, it could be modified and distributed anywhere as long as said license remains with all derivates and as long as all due credits are given. |
||
Integrating it with Apertium presents many possibilities, such as a direct camera-to-translation feature. |
|||
== Pre-processing input images to increase accuracy == |
|||
Tesseract is very likely to fail without proper image pre-processing, which makes implementing it along with image manipulation software such as OpenCV (a recommended option) a good idea. It also assumes certain things about the input image, such as it being binarized i.e black and white (not to be confused with greyscale). |
|||
Common pre-processing techniques include binarization (as mentioned above), noise removal, cropping, rescaling, rotation, and de-skewing. |
|||
== Word recognition == |
|||
=== Cropping out irrelevant parts === |
|||
Tesseract "reads" text by first analyzing input image for outlines and detecting spacing as well as proportions. Then it attempts to recognize each word in turn by scanning, twice, with the aid of existing trained models. Successful tries are saved into new training data.<ref>https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf</ref> |
|||
You do not want to distract the Tesseract from what is important. Cropping out the parts that does not include the text we want, best done manually, is one of the best things you could to increase the OCR’s reliability. Letting (or making sure!) the user cropped for the necessary text before sending out the input image for the rest would be a vital feature to have. |
|||
Below are several trained data model choices for Tesseract 4.0.0<ref>https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017</ref>. User contributions can also be found [https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions here]. |
|||
=== Binarization === |
|||
Binarization is the process of converting the many colors of an image to exactly two – commonly black and white – and could be thought of as extreme contrast. This could be done using OpenCV with one of its several Image Thresholding functions. The goal is to have a clear high-contrast result that sustains all the wanted information. |
|||
* [https://github.com/tesseract-ocr/tessdata_fast tessdata_fast], integerized and shipped by default with Tesseract. |
|||
=== Noise removal === |
|||
** Only the new LSTM-based OCR engine is supported. |
|||
Also called denoising, a term which has also been used to name the OpenCV functions used for noise removal. |
|||
** Whether it's the best option or not differs by language - for most languages, it is not. |
|||
** 8-bit, so incremental training or fine-tuning is not possible. |
|||
** includes both single-language and multilingual-script models |
|||
* [https://github.com/tesseract-ocr/tessdata_best tessdata_best], significantly slower |
|||
** Only the new LSTM-based OCR engine is supported. |
|||
** For most languages, slightly better accuracy. |
|||
** Better for certain retraining scenarios for advanced users. |
|||
* [https://github.com/tesseract-ocr/tessdata tessdata]. |
|||
** for the legacy tesseract engine (--oem 0) as well as the new LSTM neural net based engine (--oem 1). |
|||
== References == |
|||
=== Rescaling, rotation, and de-skewing === |
|||
<references/> |
|||
Tesseract has been said to work best with approximately diagonal black-in-white text of at least 20px high, with the image being of at least 300dpi in quality. |
|||
== Implementing Tesseract within Apertium == |
|||
This (theoretical) implementation directly integrates Tesseract into Apertium. Both are to be shipped together. |
|||
=== Using the Python wrapper library pytesseract === |
|||
=== Shipping Apertium with Tesseract, OpenCV, and pytesseract === |
|||
== Implementing Tesseract in the cloud == |
|||
Tesseract OCR (and thus OpenCV) does not necessarily have to be directly implemented into and shipped with Apertium. This paper describes an implementation that sets up Tesseract in the cloud, where it communicates with Apertium through internet connection. |
|||
This implementation has the advantage of reducing strain on the device Apertium is in, as the combination of Tesseract and OpenCV can be quite heavy on memory usage. However, making Tesseract available only there is internet connection decreases its accesibility. |
|||
=== Setting up Tesseract on a remote server === |
|||
=== Letting Apertium communicate with Tesseract === |
|||
== Alternatives and ideas == |
|||
* Replacing OpenCV with other image manipulation programs, such as ImageMagick. |
|||
* Implementing Tesseract integration as an optional plug-in. |
|||
== Resources == |
Latest revision as of 14:36, 28 October 2018
Introduction[edit]
This article describes an overview of Tesseract and what it would take to theoretically integrate it into Apertium.
Tesseract is an open source engine for converting pictures into text, said to be one of the most accurate programs of its type. It was originally developed by Hewlett Packard and then later sponsored by Google. Licensed under the Apache License 2.0, it could be modified and distributed anywhere as long as said license remains with all derivates and as long as all due credits are given.
Integrating it with Apertium presents many possibilities, such as a direct camera-to-translation feature.
Word recognition[edit]
Tesseract "reads" text by first analyzing input image for outlines and detecting spacing as well as proportions. Then it attempts to recognize each word in turn by scanning, twice, with the aid of existing trained models. Successful tries are saved into new training data.[1]
Below are several trained data model choices for Tesseract 4.0.0[2]. User contributions can also be found here.
- tessdata_fast, integerized and shipped by default with Tesseract.
- Only the new LSTM-based OCR engine is supported.
- Whether it's the best option or not differs by language - for most languages, it is not.
- 8-bit, so incremental training or fine-tuning is not possible.
- includes both single-language and multilingual-script models
- tessdata_best, significantly slower
- Only the new LSTM-based OCR engine is supported.
- For most languages, slightly better accuracy.
- Better for certain retraining scenarios for advanced users.
- tessdata.
- for the legacy tesseract engine (--oem 0) as well as the new LSTM neural net based engine (--oem 1).