Difference between revisions of "Training Tesseract"
Line 9: | Line 9: | ||
On Ubuntu, fonts are usually at <code>/usr/share/fonts </code>, but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example) |
On Ubuntu, fonts are usually at <code>/usr/share/fonts </code>, but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example) |
||
For example, if you are training for |
For example, if you are training for Tuvan with Times New Roman: |
||
<code> text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts </code> |
<code> text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts </code> |
Revision as of 21:10, 30 December 2015
Contents
Creating Training Text
To train tesseract, first create some training text. Make sure the text is not too long, because this will make training take forever, but make sure it includes around at least 10 of each character you want the language trained on.
Creating Training Images
Tesseract has an option to generate images from text doing training. To do this, run:
text2image --text=training_text.txt --outputbase=[lang_code].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts
On Ubuntu, fonts are usually at /usr/share/fonts
, but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example)
For example, if you are training for Tuvan with Times New Roman:
text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts
Training
Generating .tr files
The first step in training is generating tr files from the images you created. Do this by running:
tesseract [lang_code].[fontname].exp0.tif [lang_code].[fontname].exp0 box.train
You will have to run this command for each font.
Character set
To get the charset, run:
unicharset_extractor lang.*.exp0.box
Fontproperties file
You must specify a font_properties file, with each line a font in the following format:
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
Clustering
Run the following 3 commands:
shapeclustering -F font_properties -U unicharset lang.*.exp0.tr
(only for indic languages)
mftraining -F font_properties -U unicharset -O lang.unicharset lang.*.exp0.tr
cntraining lang.*.exp0.tr
Rename the files normproto, pffmtable, inttemp
to be prefixed with <lang_code>.
DAWG files
It is recommended that you have a list of word bigrams (line separated) and a wordlist (also line separated, but not necessarily complete). Run:
wordlist2dawg wordlist lang.word-dawg lang.unicharset
wordlist2dawg bigram_list lang.bigram-dawg lang.unicharset
Final steps
Run combine_tessdata lang_code.
to get the final .traineddata
file.