Difference between revisions of "Target-language tagger training"

Latest revision as of 08:16, 8 October 2014

The package apertium-tagger-training-tools trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort (you don't have to manually tag a corpus). In this description, a part-of-speech tagger for the source language (SL) will be trained using information from the target language (TL).

Language pair[edit]

This example presumes that you want to train a tagger for the Occitan ←→ Catalan (apertium-oc-ca) language pair in the Occitan → Catalan (oc-ca) direction. You will need to substitute values that refer to that pair with those for your chosen language pair.

You will need to download and install the language pair in question either from SVN or from a package. The method implemented in this package is appropriate for those language whose part-of-speech tagger was trained in an unsupervised manner.

To prepare and compile the required language-pair data follow the instruction provided at the linguistic package itself. Usually you only need to type ./configure and make.

Building a target language model[edit]

If you're using apertium-trigrams-langmodel, then follow this section, if not, continue to the next.

Requirements: A raw corpus of the target language (ca.corpus.txt). If you are generating oc→ca, this would be ca, if you're generating a tagger for ca→es this would be es. The corpus should be around 0.5 million words, bigger than this does not result in significant improvement in accuracy.

$ apertium-trigrams-langmodel -t -i ca.corpus.txt > catalan.lm

The following output should appear:

LOCALE: en_GB.UTF-8
Training........................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
......................
2868350 processed words.
1643982 distinct trigrams found.
833143 distinct bigrams found.
153714 distinct monograms found.
Writing word2id information ...
Writing Simple Good-Turing model for 1-grams ...
Writing 1-grams ...
Writing 2-grams...
Writing 3-grams...

The dots will continue while the language model is being constructed, and the number will vary.

Preparing the source-language data[edit]

Requirements:

A raw corpus of the source language (corpus.txt — for example oc.corpus.txt)
A source dictionary (dic.dix — e.g. apertium-oc-ca.oc.dix)
A compiled dictionary of the source language (dic.bin — e.g. oc-ca.automorf.bin)
A tagger definition file (tagger.tsx — e.g. apertium-oc-ca.oc.tsx)
The file with the transfer rules to be used (trules.xml or .t1x file — e.g. apertium-oc-ca.oc-ca.t1x)

Note: dic.bin was generated when preparing the language pair data. dic.dix, tagger.tsx and trules.xml are provided with the language-pair package.

Generate the corpus file[edit]

$ apertium-tagger-gen-crp-file oc.corpus.txt dic.bin > lang.crp

Should give the output:

Generating crp file
This may take some time. Please, take a cup of coffee and come back later.

Generate the dic file[edit]

$ apertium-tagger-gen-dic-file dic.dix dic.bin tagger.tsx > lang.dic

Should give the output:

Generating dic file
This may take some time. Please, take a cup of coffee and come back later.

Extract regex rules[edit]

$ apertium-xtract-regex-trules trules.xml > regexp-trules.txt

Should give no output, but leave the file regexp-trules.txt

Preparing the translation script[edit]

At this point you should have the following files:

$ ls -1 -sh
total 132M
 78M corpus.crp
 38M ca.corpus.lm
 17M corpus.txt
 16K oc.dic
8.0K regex-trules.txt

The TL-driven training algorithm needs to be provided with a translation script conveniently configured to translate a given disambiguation hypothesis into the TL.

Together with this packages you can find the script 'translation-script-es-ca.sh'. It is configured to translates Spanish disambiguation hypothesis into Catalan.

Copy this file from the example/ directory and then edit this file and change the DATA and DIRECTION variables. DATA must point to the folder holding the language-pair data previously prepared; DIRECTION must store the translation direction.

Assuming you are in the oc-tagger-data directory:

$ cp apertium-tagger-training-tools/example/translation-script-es-ca-batch-mode.sh .
$ mv translation-script-es-ca-batch.sh translation-script-oc-ca-batch-mode.sh

Then edit the translation-script-oc-ca-batch-mode.sh and change the above variables.

Three-stage transfer[edit]

Note: The apertium-tagger-training-tools package does not currently work properly with three-stage transfer. It is possible to use it, by making the following changes to the translation-script, but the segmentation will be wrong and this is likely to effect the final quality of the tagger.

Change the line:

apertium-transfer $DATA/trules-$DIRECTION.xml $DATA/trules-$DIRECTION.bin $AUTOBIL |\

To:

apertium-transfer $DATA/$DIRECTION.t1x  $DATA/$DIRECTION.t1x.bin  $DATA/$DIRECTION.autobil.bin |\
apertium-interchunk $DATA/$DIRECTION.t2x  $DATA/$DIRECTION.t2x.bin |\
apertium-postchunk $DATA/$DIRECTION.t3x  $DATA/$DIRECTION.t3x.bin |\

Preparing the likelihood script[edit]

To estimate the likelihood of each translation the TL-driven algorithm is provided with an script. In this package you can find an example of this script called 'likelihood-script-catalan.sh'. It uses the apertium-trigrams-langmodel package to calculate the likelihood of each input string.

Again, assuming you're in the oc-tagger-data directory:

$ cp ../../apertium-tagger-training-tools/example/likelihood-script-catalan-batch-mode.sh .
$ mv likelihood-script-catalan.sh likelihood-script-occitan-batch-mode.sh

Change this script to use the desired data or to use another language model. You'll need to change the LMDATA variable to be ca.corpus.lm. Keep in mind that the TL-driven algorithm will provide an input TL string to the script and that it expects a likelihood, i. e. a double value, conveniently formated using the appropriate locale.

Training through the TL-driven algorithm[edit]

After you have all of that collected you can generate the .prob file using the following commands:

Warning: Some language pairs perform some orthographical operations after the transfer module. In those cases is a good idea to provide the superficial forms (words) involved in those operations through the --supforms parameter.

In the following examples the --file argument specifies the prefix of the main files used, so for example if you have: corpus.crp, corpus.dic, and corpus.txt the prefix will need to be 'corpus'.

Commands[edit]

Without disambiguation hypothesis pruning

$ apertium-tagger-tl-trainer --train 500000 \ 
                             --tsxfile tagger.tsx \ 
                             --file <prefix> \ 
                             --tscript ./translation-script.sh \
                             --lscript ./likelihood-script.sh \ 
                             --trules regexp-trules.txt

With disambiguation hypothesis pruning

To do this you will need an intial model, initialmodel.prob, estimated through another training method (Kupiec, Baum-Welch, ...) — creating this initial model is described in the article about tagger training.

$ apertium-tagger-tl-trainer --tsxfile tagger.tsx \
                             --train 500000 \
                             --prune 1 1000 0.6 1 \
                             --initprob initialmodel.prob \ 
                             --file <prefix>  \
                             --tscript ./translation-script.sh \ 
                             --lscript ./likelihood-script.sh \ 
                             --trules regexp-trules.txt

Output[edit]

This is some example output from the training process. If what you've got looks like this, then you're on the right track!

Command line: apertium-tagger-tl-trainer --train 300000 --tsxfile ../apertium-oc-ca.oc.tsx \
--file corpus --tscript ./translation-script-oc-ca-batch.sh --lscript ./likelihood-script-catalan-batch.sh \
--trules regex-trules.txt --norules --gen-paths oc.PATHS
Reading transfer rules from file 'regex-trules.txt' done.
Calculating ambiguity classes ...
92 states and 283 ambiguity classes
Target-language driven HMM-based part-of-speech training method.......
   Training corpus will be processed for 500000 words
   HMM parameters will be calculated each time 0 words are processed
   Calculated parameter will be saved in files with the name 'corpus.N.prob'
   Are fobidden and enforce rules going to be used? 1
   Translation script is: './translation-script-oc-ca-batch.sh'
   Likelihood estimation script is: './likelihood-script-catalan-batch.sh'
Ready for training...... go!

Initialising allowed bigrams ... done.
Error: conversion error
Error: conversion error
Error: There is a path with a null translation: ^se<prn><pro><ref><p3><mf><sp>$ ^èsser<vbser><pri><p3><sg>$ 
^pas<adv>$ ^que<cnjsub>$ ^lo<det><def><m><sg>$ ^miralh<n><m><sg>$ ^fisèl<adj><m><sg>$ ^de<pr>$
SEGMENT: s' es pas que lo miralh fisèl de 
Warning: This segment has no OK translations. Skipping
SEGMENT: s' es pas que lo miralh fisèl de 

Warning: This segment has no translations into TL 1. Skipping
SEGMENT: un parlaire adult 
Warning: This segment has no OK translations. Skipping
SEGMENT: a de  
Error: There is a path with a null translation: ^çò<detnt>$ ^de<pr>$
SEGMENT: çò d' 
Warning: This segment has no OK translations. Skipping
SEGMENT: çò d' 
Warning: This segment has no OK translations. Skipping
SEGMENT: que pòdon instituir la lenga . 
Warning: This segment has no OK translations. Skipping
SEGMENT: a l' un còp l' instituir e

Troubleshooting[edit]

Error: pcre_compile missing )

This does not mean that pcre_compile is missing, but that pcre_compile didn't find a ) that it expected.

Errors like this can pop up if your trules.xml have unescaped #'s in them. Use \# instead.

Difference between revisions of "Target-language tagger training"

Latest revision as of 08:16, 8 October 2014

Contents

Language pair[edit]

Building a target language model[edit]

Preparing the source-language data[edit]

Generate the corpus file[edit]

Generate the dic file[edit]

Extract regex rules[edit]

Preparing the translation script[edit]

Three-stage transfer[edit]

Preparing the likelihood script[edit]

Training through the TL-driven algorithm[edit]

Commands[edit]

Output[edit]

Troubleshooting[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+[[Entraînement d'un tagueur de langue cible|En français]]
 {{TOCD}}
-==Locale settings==
+The package <code>apertium-tagger-training-tools</code> trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort (you don't have to manually tag a corpus). In this description, a part-of-speech tagger for the source language (<code>SL</code>) will be trained using information from the target language (<code>TL</code>).
-You need to set the appropriate locale. Note that apertium and apertium-tagger-training-tools NOW support UTF-based locales, and apertium-tagger-training-tools has ONLY been tested with UTF-8 locales.
+==Language pair==
-The programs implemented in this package will use the user's locale found in <code>LC_ALL</code>. Therefore, you only need to conveniently set up this variable. You can check your locale by issuing the command:
+This example presumes that you want to train a tagger for the Occitan ←→ Catalan (<code>apertium-oc-ca</code>) language pair in the Occitan → Catalan (<code>oc-ca</code>) direction. You will need to substitute values that refer to that pair with those for your chosen language pair.
-<pre>
-$ echo $LC_ALL
+You will need to download and install the language pair in question either from [[SVN]] or from a package. The method implemented in this package is appropriate for those language whose part-of-speech tagger was trained in an unsupervised manner.
-</pre>
+To prepare and compile the required language-pair data follow the instruction provided at the linguistic package itself. Usually you only need to type <code>./configure</code> and <code>make</code>.
-It should return something like <code>en_GB.UTF-8</code>
 ==Building a target language model==
@@ Line 16: / Line 17: @@
 If you're using  <code>apertium-trigrams-langmodel</code>, then follow this section, if not, continue to the next.
+Requirements: A raw corpus of the ''target'' language (<code>ca.corpus.txt</code>). If you are generating oc→ca, this would be <code>ca</code>, if you're generating a tagger for ca→es this would be <code>es</code>. The corpus should be around 0.5 million words, bigger than this does not result in significant improvement in accuracy.
-I suppose you will use the language model software provided within this package, but another language model could be used. If you plan to use another language model please skip this section.
-Requirements: A raw corpus of the target language (corpus.txt)
 <pre>
-$ apertium-trigrams-langmodel -t -i corpus.txt > model.lm
+$ apertium-trigrams-langmodel -t -i ca.corpus.txt > catalan.lm
 </pre>
+The following output should appear:
-==Preparing the language pair to be used==
+<pre>
-A part-of-speech tagger for the source language (SL) will be trained
+LOCALE: en_GB.UTF-8
-using information from the target language (TL). Therefore, you need
+Training........................................................................
-the data required to translate SL disambiguation hypotheses into the
+................................................................................
-TL. From http://apertium.sourceforge.net you can download data for
+................................................................................
-some language pairs. The method implemented in this package is
+................................................................................
-appropriate for those language whose part-of-speech tagger was
+................................................................................
-unsupervisedly trained.
+................................................................................
+................................................................................
+......................
+2868350 processed words.
+1643982 distinct trigrams found.
+distinct bigrams found.
+distinct monograms found.
+Writing word2id information ...
+Writing Simple Good-Turing model for 1-grams ...
+Writing 1-grams ...
+Writing 2-grams...
+Writing 3-grams...
+</pre>
+The dots will continue while the language model is being constructed, and the number will vary.
-To prepare and compile the required language-pair data follow the
-instruction provided at the linguistic package itself. Usually you
-only need to type make.
 ==Preparing the source-language data==
@@ Line 42: / Line 52: @@
 Requirements:
-*A raw corpus of the source language (corpus.txt)
+*A raw corpus of the ''source'' language (corpus.txt &mdash; for example <code>oc.corpus.txt</code>)
-*A source dictionary (dic.dix)
+*A source dictionary (dic.dix &mdash; e.g. <code>apertium-oc-ca.oc.dix</code>)
-*A compiled dictionary of the source language (dic.bin)
+*A compiled dictionary of the source language (dic.bin &mdash; e.g. <code>oc-ca.automorf.bin</code>)
-*A tagger definition file (tagger.tsx)
+*A tagger definition file (tagger.tsx &mdash; e.g. <code>apertium-oc-ca.oc.tsx</code>)
-*The file with the transfer rules to be used (trules.xml)
+*The file with the transfer rules to be used (trules.xml or .t1x file &mdash; e.g. <code>apertium-oc-ca.oc-ca.t1x</code>)
+''Note: dic.bin was generated when preparing the language pair data. dic.dix, tagger.tsx and trules.xml are provided with the language-pair package.''
+===Generate the corpus file===
+<pre>
+$ apertium-tagger-gen-crp-file oc.corpus.txt dic.bin > lang.crp
+</pre>
+Should give the output:
+<pre>
+Generating crp file
+This may take some time. Please, take a cup of coffee and come back later.
+</pre>
+===Generate the dic file===
 <pre>
-$ apertium-tagger-gen-crp-file corpus.txt dic.bin > lang.crp
 $ apertium-tagger-gen-dic-file dic.dix dic.bin tagger.tsx > lang.dic
+</pre>
+Should give the output:
+<pre>
+Generating dic file
+This may take some time. Please, take a cup of coffee and come back later.
+</pre>
+===Extract regex rules===
+<pre>
 $ apertium-xtract-regex-trules trules.xml > regexp-trules.txt
 </pre>
+Should give no output, but leave the file <code>regexp-trules.txt</code>
-''Note: dic.bin was generated when preparing the language pair data. dic.dix, tagger.tsx and trules.xml are provided with the language-pair package.''
 ==Preparing the translation script==
+At this point you should have the following files:
+<pre>
+$ ls -1 -sh
+total 132M
+M corpus.crp
+M ca.corpus.lm
+M corpus.txt
+K oc.dic
+.0K regex-trules.txt
+</pre>
 The TL-driven training algorithm needs to be provided with a translation script conveniently configured to translate a given disambiguation hypothesis into the TL.
-Together with this packages you can find the script 'translation-script-es-ca.sh'. It is configured to translates Spanish disambiguation hypothesis into Catalan. Edit this file and change the <code>DATA</code> and <code>DIRECTION</code> variables. <code>DATA must point to the folder holding the language-pair data previously prepared; <code>DIRECTION</code> must store the translation direction.
+Together with this packages you can find the script 'translation-script-es-ca.sh'. It is configured to translates Spanish disambiguation hypothesis into Catalan.
+Copy this file from the <code>example/</code> directory and then edit this file and change the <code>DATA</code> and <code>DIRECTION</code> variables. <code>DATA</code> must point to the folder holding the language-pair data previously prepared; <code>DIRECTION</code> must store the translation direction.
+Assuming you are in the <code>oc-tagger-data</code> directory:
+<pre>
+$ cp apertium-tagger-training-tools/example/translation-script-es-ca-batch-mode.sh .
+$ mv translation-script-es-ca-batch.sh translation-script-oc-ca-batch-mode.sh
+</pre>
+Then edit the <code>translation-script-oc-ca-batch-mode.sh</code> and change the above variables.
+===Three-stage transfer===
+''Note: The apertium-tagger-training-tools package does not currently work properly with three-stage transfer. It is possible to use it, by making the following changes to the translation-script, but the segmentation will be wrong and this is likely to effect the final quality of the tagger.''
+Change the line:
+<pre>
+apertium-transfer $DATA/trules-$DIRECTION.xml $DATA/trules-$DIRECTION.bin $AUTOBIL |\
+</pre>
+To:
+<pre>
+apertium-transfer $DATA/$DIRECTION.t1x  $DATA/$DIRECTION.t1x.bin  $DATA/$DIRECTION.autobil.bin |\
+apertium-interchunk $DATA/$DIRECTION.t2x  $DATA/$DIRECTION.t2x.bin |\
+apertium-postchunk $DATA/$DIRECTION.t3x  $DATA/$DIRECTION.t3x.bin |\
+</pre>
 ==Preparing the likelihood script==
@@ Line 66: / Line 144: @@
 To estimate the likelihood of each translation the TL-driven algorithm is provided with an script. In this package you can find an example of this script called 'likelihood-script-catalan.sh'. It uses the <code>apertium-trigrams-langmodel</code> package to calculate the likelihood of each input string.
+Again, assuming you're in the <code>oc-tagger-data</code> directory:
-Change this script to use the desired data or to use another language model. Keep in mind that the TL-driven algorithm will provide an input TL string to the script and that it expects a likelihood, i. e. a double value, conveniently formated using the appropriate locale.
+<pre>
+$ cp ../../apertium-tagger-training-tools/example/likelihood-script-catalan-batch-mode.sh .
+$ mv likelihood-script-catalan.sh likelihood-script-occitan-batch-mode.sh
+</pre>
+Change this script to use the desired data or to use another language model. You'll need to change the <code>LMDATA</code> variable to be <code>ca.corpus.lm</code>. Keep in mind that the TL-driven algorithm will provide an input TL string to the script and that it expects a likelihood, i. e. a double value, conveniently formated using the appropriate locale.
 ==Training through the TL-driven algorithm==
+After you have all of that collected you can generate the <code>.prob</code> file using the following commands:
-Training without disambiguation hypothesis pruning:
+''Warning: Some language pairs perform some orthographical operations after the transfer module. In those cases is a good idea to provide the superficial forms (words) involved in those operations through the --supforms parameter.''
+In the following examples the <code>--file</code> argument specifies the prefix of the main files used, so for example if you have: <code>corpus.crp</code>, <code>corpus.dic</code>, and <code>corpus.txt</code> the prefix will need to be 'corpus'.
+===Commands===
+;Without disambiguation hypothesis pruning
 <pre>
 $ apertium-tagger-tl-trainer --train 500000 \
                              --tsxfile tagger.tsx \
-                             --file lang \
+                             --file <prefix> \
-                             --tscript translation-script.sh \
+                             --tscript ./translation-script.sh \
-                             --lscript likelihood-script.sh \
+                             --lscript ./likelihood-script.sh \
                              --trules regexp-trules.txt
 </pre>
-Training with disambiguation hypothesis pruning, you need an intial model ('initialmodel.prob') estimated through another training method (Kupiec, Baum-Welch, ...):
+;With disambiguation hypothesis pruning
+To do this you will need an intial model, <code>initialmodel.prob</code>, estimated through another training method (Kupiec, Baum-Welch, ...) &mdash; creating this initial model is described in the article about [[tagger training]].
 <pre>
@@ Line 88: / Line 183: @@
                              --prune 1 1000 0.6 1 \
                              --initprob initialmodel.prob \
-                             --file lang  \
+                             --file <prefix>  \
-                             --tscript translation-script.sh \
+                             --tscript ./translation-script.sh \
-                             --lscript likelihood-script.sh \
+                             --lscript ./likelihood-script.sh \
-                             --trules regexp-trules.txt  \
+                             --trules regexp-trules.txt
-                             --supforms "me|te|se|el|la|de|a|por"
 </pre>
+===Output===
-''Warning: Some language pairs perform some orthographical operations after the transfer module. In those cases is a good idea to provide the superficial forms (words) involved in those operations through the --supforms parameter.''
+This is some example output from the training process. If what you've got looks like this, then you're on the right track!
-For example, in the case of the es-ca language pair:
 <pre>
-$ apertium-tagger-tl-trainer --train 500000 \
+Command line: apertium-tagger-tl-trainer --train 300000 --tsxfile ../apertium-oc-ca.oc.tsx \
+--file corpus --tscript ./translation-script-oc-ca-batch.sh --lscript ./likelihood-script-catalan-batch.sh \
-                             --tsxfile tagger.tsx \
+--trules regex-trules.txt --norules --gen-paths oc.PATHS
-                             --file lang \
+Reading transfer rules from file 'regex-trules.txt' done.
-                             --tscript translation-script.sh \
+Calculating ambiguity classes ...
-                             --lscript likelihood-script.sh \
+states and 283 ambiguity classes
-                             --trules regexp-trules.txt \
+Target-language driven HMM-based part-of-speech training method.......
-                             --supforms "me|te|se|el|la|de|a|por"
+   Training corpus will be processed for 500000 words
-</pre>
+   HMM parameters will be calculated each time 0 words are processed
+   Calculated parameter will be saved in files with the name 'corpus.N.prob'
+   Are fobidden and enforce rules going to be used? 1
+   Translation script is: './translation-script-oc-ca-batch.sh'
+   Likelihood estimation script is: './likelihood-script-catalan-batch.sh'
+Ready for training...... go!
+Initialising allowed bigrams ... done.
-For more options:
+Error: conversion error
+Error: conversion error
+Error: There is a path with a null translation: ^se<prn><pro><ref><p3><mf><sp>$ ^èsser<vbser><pri><p3><sg>$
+^pas<adv>$ ^que<cnjsub>$ ^lo<det><def><m><sg>$ ^miralh<n><m><sg>$ ^fisèl<adj><m><sg>$ ^de<pr>$
+SEGMENT: s' es pas que lo miralh fisèl de
+Warning: This segment has no OK translations. Skipping
+SEGMENT: s' es pas que lo miralh fisèl de
+Warning: This segment has no translations into TL 1. Skipping
+SEGMENT: un parlaire adult
+Warning: This segment has no OK translations. Skipping
+SEGMENT: a de
+Error: There is a path with a null translation: ^çò<detnt>$ ^de<pr>$
+SEGMENT: çò d'
+Warning: This segment has no OK translations. Skipping
+SEGMENT: çò d'
+Warning: This segment has no OK translations. Skipping
+SEGMENT: que pòdon instituir la lenga .
+Warning: This segment has no OK translations. Skipping
+SEGMENT: a l' un còp l' instituir e
+</pre>
+==Troubleshooting==
 <pre>
+Error: pcre_compile missing )
-$ apertium-tagger-tl-trainer --help
 </pre>
+''This does not mean that pcre_compile is missing, but that pcre_compile didn't find a ''')''' that it expected.''
+Errors like this can pop up if your trules.xml have unescaped '''#''''s in them. Use '''\#''' instead.
+[[Category:Tagger]]
 [[Category:Documentation]]
+[[Category:Documentation in English]]