Difference between revisions of "Training Tesseract"

From Apertium
Jump to navigation Jump to search
(→‎Fontproperties file: realigned code blocks)
 
(19 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
Tesseract has an option to generate images from text doing training. To do this, run:
 
Tesseract has an option to generate images from text doing training. To do this, run:
   
<code> text2image --text=training_text.txt --outputbase=[lang_code].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts </code>
+
$ text2image --text=training_text.txt --outputbase=[lang_code].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts
   
 
On Ubuntu, fonts are usually at <code>/usr/share/fonts </code>, but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example)
 
On Ubuntu, fonts are usually at <code>/usr/share/fonts </code>, but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example)
Line 11: Line 11:
 
For example, if you are training for Tuvan with Times New Roman:
 
For example, if you are training for Tuvan with Times New Roman:
   
<code> text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts </code>
+
$ text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts
   
 
== Training ==
 
== Training ==
Line 17: Line 17:
 
The first step in training is generating tr files from the images you created. Do this by running:
 
The first step in training is generating tr files from the images you created. Do this by running:
   
<code> $ tesseract [lang_code].[fontname].exp0.tif [lang_code].[fontname].exp0 box.train </code>
+
$ tesseract [lang_code].[fontname].exp0.tif [lang_code].[fontname].exp0 box.train
   
 
You will have to run this command for each font.
 
You will have to run this command for each font.
 
For Tuvan and Times New roman:
 
For Tuvan and Times New roman:
   
<code> $ tesseract tyv.TimesNewRoman.exp0.tif tyv.TimesNewRoman.exp0 box.train </code>
+
$ tesseract tyv.TimesNewRoman.exp0.tif tyv.TimesNewRoman.exp0 box.train
   
 
=== Character set ===
 
=== Character set ===
 
To get the charset, run:
 
To get the charset, run:
   
<code> unicharset_extractor [lang].*.exp0.box </code>
+
$ unicharset_extractor [lang].*.exp0.box
   
 
This gets all the box files, so you can run that command verbatim for all fonts.
 
This gets all the box files, so you can run that command verbatim for all fonts.
   
 
For Tuvan:
 
For Tuvan:
 
$ unicharset_extractor tyv.*.exp0.box
 
<code> unicharset_extractor tyv.*.exp0.box </code>
 
   
 
== Fontproperties file ==
 
== Fontproperties file ==
You must specify a font_properties file, with each line a font in the following format:
+
You must specify a file named <code> font_properties </code>, with each line a font in the following format:
   
 
<code> <fontname> <italic> <bold> <fixed> <serif> <fraktur> </code>, where you fill in each property with a 1 or 0 depending on whether the property exists. For example for Times new roman italic, a serif font:
 
<code> <fontname> <italic> <bold> <fixed> <serif> <fraktur> </code>, where you fill in each property with a 1 or 0 depending on whether the property exists. For example for Times new roman italic, a serif font:
Line 45: Line 44:
 
Run the following 3 commands:
 
Run the following 3 commands:
   
<code> $ shapeclustering -F font_properties -U unicharset [lang].*.exp0.tr </code> (only for indic languages)
+
$ shapeclustering -F font_properties -U unicharset [lang].*.exp0.tr # only for indic languages
 
$ mftraining -F font_properties -U unicharset -O [lang].unicharset [lang].*.exp0.tr
 
 
$ cntraining [lang].*.exp0.tr
<code> $ mftraining -F font_properties -U unicharset -O [lang].unicharset [lang].*.exp0.tr </code>
 
 
<code> $ cntraining [lang].*.exp0.tr </code>
 
   
 
Rename the files <code> normproto, pffmtable, inttemp </code> to be prefixed with <code> <lang_code>. </code>
 
Rename the files <code> normproto, pffmtable, inttemp </code> to be prefixed with <code> <lang_code>. </code>
Line 56: Line 53:
 
For Tuvan:
 
For Tuvan:
   
<code> $ mftraining -F font_properties -U unicharset -O lang.unicharset tyv.*.exp0.tr </code>
+
$ mftraining -F font_properties -U unicharset -O lang.unicharset tyv.*.exp0.tr
 
$ cntraining tyv.*.exp0.tr
 
  +
$ mv normproto tyv.normproto
<code> $ cntraining tyv.*.exp0.tr </code>
 
  +
$ mv pffmtable tyv.pffmtable
 
  +
$ mv inttemp tyv.inttemp
And prefix <code> normproto, pffmtable, inttemp </code> with <code> tyv. </code>
 
   
 
=== DAWG files ===
 
=== DAWG files ===
 
It is recommended that you have a list of word bigrams (line separated) and a wordlist (also line separated, but not necessarily complete). Run:
 
It is recommended that you have a list of word bigrams (line separated) and a wordlist (also line separated, but not necessarily complete). Run:
<code> wordlist2dawg wordlist [lang].word-dawg lang.unicharset </code>
 
<code> wordlist2dawg bigram_list [lang].bigram-dawg lang.unicharset </code>
 
 
   
 
$ wordlist2dawg wordlist [lang].word-dawg lang.unicharset
 
$ wordlist2dawg bigram_list [lang].bigram-dawg lang.unicharset
 
For Tuvan:
 
For Tuvan:
   
<code> wordlist2dawg wordlist tyv.word-dawg tyv.unicharset </code>
+
$ wordlist2dawg wordlist tyv.word-dawg tyv.unicharset
<code> wordlist2dawg bigram_list tyv.bigram-dawg tyv.unicharset </code>
+
$ wordlist2dawg bigram_list tyv.bigram-dawg tyv.unicharset
   
 
=== Final steps ===
 
=== Final steps ===
Run <code>combine_tessdata lang_code.</code> to get the final <code>.traineddata</code> file.
+
To get the final <code>.traineddata</code> file, run:
  +
$ combine_tessdata lang_code.
   
 
For Tuvan:
 
For Tuvan:
<code>combine_tessdata tyv.</code>
+
$ combine_tessdata tyv.
   
 
The output will be <code> tyv.traineddata </code>
 
The output will be <code> tyv.traineddata </code>

Latest revision as of 21:50, 30 December 2015

Creating Training Text[edit]

To train tesseract, first create some training text. Make sure the text is not too long, because this will make training take forever, but make sure it includes around at least 10 of each character you want the language trained on.

Creating Training Images[edit]

Tesseract has an option to generate images from text doing training. To do this, run:

$ text2image --text=training_text.txt --outputbase=[lang_code].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts

On Ubuntu, fonts are usually at /usr/share/fonts , but this path is platform specific. If you are training on multiple fonts, you will have to run this command multiple times. For the purposes of text2image, italics are considered a different font (you will have to run it once for Times, and once for Times Italic, for example)

For example, if you are training for Tuvan with Times New Roman:

$ text2image --text=training_text.txt --outputbase=tyv.TimesNewRoman.exp0 --font='Times New Roman' --fonts_dir=/usr/share/fonts

Training[edit]

Generating .tr files[edit]

The first step in training is generating tr files from the images you created. Do this by running:

$ tesseract [lang_code].[fontname].exp0.tif [lang_code].[fontname].exp0 box.train

You will have to run this command for each font. For Tuvan and Times New roman:

$ tesseract tyv.TimesNewRoman.exp0.tif tyv.TimesNewRoman.exp0 box.train

Character set[edit]

To get the charset, run:

$ unicharset_extractor [lang].*.exp0.box

This gets all the box files, so you can run that command verbatim for all fonts.

For Tuvan:

$ unicharset_extractor tyv.*.exp0.box

Fontproperties file[edit]

You must specify a file named font_properties , with each line a font in the following format:

<fontname> <italic> <bold> <fixed> <serif> <fraktur> , where you fill in each property with a 1 or 0 depending on whether the property exists. For example for Times new roman italic, a serif font:

timesitalic 1 0 0 1 0

Clustering[edit]

Run the following 3 commands:

$ shapeclustering -F font_properties -U unicharset [lang].*.exp0.tr  # only for indic languages
$ mftraining -F font_properties -U unicharset -O [lang].unicharset [lang].*.exp0.tr
$ cntraining [lang].*.exp0.tr

Rename the files normproto, pffmtable, inttemp to be prefixed with <lang_code>.


For Tuvan:

$ mftraining -F font_properties -U unicharset -O lang.unicharset tyv.*.exp0.tr
$ cntraining tyv.*.exp0.tr
$ mv normproto tyv.normproto
$ mv pffmtable tyv.pffmtable
$ mv inttemp tyv.inttemp

DAWG files[edit]

It is recommended that you have a list of word bigrams (line separated) and a wordlist (also line separated, but not necessarily complete). Run:

$ wordlist2dawg wordlist [lang].word-dawg lang.unicharset
$ wordlist2dawg bigram_list [lang].bigram-dawg lang.unicharset

For Tuvan:

$ wordlist2dawg wordlist tyv.word-dawg tyv.unicharset
$ wordlist2dawg bigram_list tyv.bigram-dawg tyv.unicharset

Final steps[edit]

To get the final .traineddata file, run:

$ combine_tessdata lang_code.

For Tuvan:

$ combine_tessdata tyv.

The output will be tyv.traineddata