Difference between revisions of "Target-language tagger training"
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
The package <code>apertium-tagger-training-tools</code> trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort. |
The package <code>apertium-tagger-training-tools</code> trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort. In this description, a part-of-speech tagger for the source language (<code>SL</code>) will be trained using information from the target language (<code>TL</code>). |
||
==Language pair== |
==Language pair== |
Revision as of 11:41, 31 January 2008
The package apertium-tagger-training-tools
trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort. In this description, a part-of-speech tagger for the source language (SL
) will be trained using information from the target language (TL
).
Language pair
This example presumes that you want to train a tagger for the Occitan ←→ Catalan (apertium-oc-ca
language pair in the Occitan → Catalan (oc-ca
) direction. You will need to substitute values that refer to that pair with those for your chosen language pair.
You will need to download and install the language pair in question either from SVN or from one of the stable packages. The method implemented in this package is appropriate for those language whose part-of-speech tagger was trained in an unsupervised manner.
To prepare and compile the required language-pair data follow the instruction provided at the linguistic package itself. Usually you only need to type ./configure
and make
.
Building a target language model
If you're using apertium-trigrams-langmodel
, then follow this section, if not, continue to the next.
I suppose you will use the language model software provided within this package, but another language model could be used. If you plan to use another language model please skip this section.
Requirements: A raw corpus of the target language (corpus.txt
). If you are generating oc→ca, this would be ca, if you're generating a tagger for ca→es this would be es.
$ apertium-trigrams-langmodel -t -i ca.corpus.txt > catalan.lm
The following output should appear:
LOCALE: en_GB.UTF-8 Training........................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ...................... 2868350 processed words. 1643982 distinct trigrams found. 833143 distinct bigrams found. 153714 distinct monograms found. Writing word2id information ... Writing Simple Good-Turing model for 1-grams ... Writing 1-grams ... Writing 2-grams... Writing 3-grams...
The dots will continue while the language model is being constructed, and the number will vary.
Preparing the source-language data
Requirements:
- A raw corpus of the source language (corpus.txt — for example
oc.corpus.txt
) - A source dictionary (dic.dix — e.g.
apertium-oc-ca.oc.dix
) - A compiled dictionary of the source language (dic.bin — e.g.
oc-ca.automorf.bin
) - A tagger definition file (tagger.tsx — e.g.
apertium-oc-ca.oc.tsx
) - The file with the transfer rules to be used (trules.xml or .t1x file — e.g.
apertium-oc-ca.oc-ca.t1x
)
Note: dic.bin was generated when preparing the language pair data. dic.dix, tagger.tsx and trules.xml are provided with the language-pair package.
Generate the corpus file
$ apertium-tagger-gen-crp-file corpus.txt dic.bin > lang.crp
Should give the output:
Generating crp file This may take some time. Please, take a cup of coffee and come back later.
Generate the dic file
$ apertium-tagger-gen-dic-file dic.dix dic.bin tagger.tsx > lang.dic
Should give the output:
Generating dic file This may take some time. Please, take a cup of coffee and come back later.
Extract regex rules
$ apertium-xtract-regex-trules trules.xml > regexp-trules.txt
Should give no output, but leave the file regexp-trules.txt
Preparing the translation script
At this point you should have the following files:
$ ls -1 -sh total 132M 78M corpus.crp 38M ca.corpus.lm 17M corpus.txt 16K oc.dic 8.0K regex-trules.txt
The TL-driven training algorithm needs to be provided with a translation script conveniently configured to translate a given disambiguation hypothesis into the TL.
Together with this packages you can find the script 'translation-script-es-ca.sh'. It is configured to translates Spanish disambiguation hypothesis into Catalan.
Copy this file from the example/
directory and then edit this file and change the DATA
and DIRECTION
variables. DATA
must point to the folder holding the language-pair data previously prepared; DIRECTION
must store the translation direction.
Assuming you are in the oc-tagger-data
directory:
$ cp apertium-tagger-training-tools/example/translation-script-es-ca.sh . $ mv translation-script-es-ca.sh translation-script-oc-ca.sh
Then edit the translation-script-oc-ca.sh
and change the above variables.
Note: The apertium-tagger-training-tools package does not currently work properly with three-stage transfer. It is possible to use it, by making the following changes to the translation-script, but the segmentation will be wrong and this is likely to effect the final quality of the tagger.
Change the line:
apertium-transfer $DATA/trules-$DIRECTION.xml $DATA/trules-$DIRECTION.bin $AUTOBIL |\
To:
apertium-transfer $DATA/$DIRECTION.t1x $DATA/$DIRECTION.t1x.bin $DATA/$DIRECTION.autobil.bin |\ apertium-interchunk $DATA/$DIRECTION.t2x $DATA/$DIRECTION.t2x.bin |\ apertium-postchunk $DATA/$DIRECTION.t3x $DATA/$DIRECTION.t3x.bin |\
Preparing the likelihood script
To estimate the likelihood of each translation the TL-driven algorithm is provided with an script. In this package you can find an example of this script called 'likelihood-script-catalan.sh'. It uses the apertium-trigrams-langmodel
package to calculate the likelihood of each input string.
Again, assuming you're in the oc-tagger-data
directory:
$ cp ../../apertium-tagger-training-tools/example/likelihood-script-catalan.sh . $ mv likelihood-script-catalan.sh likelihood-script-occitan.sh
Change this script to use the desired data or to use another language model. You'll need to change the LMDATA
variable to be ca.corpus.lm
. Keep in mind that the TL-driven algorithm will provide an input TL string to the script and that it expects a likelihood, i. e. a double value, conveniently formated using the appropriate locale.
Training through the TL-driven algorithm
After you have all of that collected you can generate the .prob
file using the following commands:
Warning: Some language pairs perform some orthographical operations after the transfer module. In those cases is a good idea to provide the superficial forms (words) involved in those operations through the --supforms parameter.
In the following examples the --file
argument specifies the prefix of the main files used, so for example if you have: corpus.crp
, corpus.dic
, corpus.lm
, and corpus.txt
the prefix will need to be 'corpus'.
Without disambiguation hypothesis pruning
$ apertium-tagger-tl-trainer --train 500000 \ --tsxfile tagger.tsx \ --file <prefix> \ --tscript translation-script.sh \ --lscript likelihood-script.sh \ --trules regexp-trules.txt
With disambiguation hypothesis pruning
To do this you will need an intial model, initialmodel.prob
, estimated through another training method (Kupiec, Baum-Welch, ...) — creating this initial model is described in the article about tagger training.
$ apertium-tagger-tl-trainer --tsxfile tagger.tsx \ --train 500000 \ --prune 1 1000 0.6 1 \ --initprob initialmodel.prob \ --file <prefix> \ --tscript translation-script.sh \ --lscript likelihood-script.sh \ --trules regexp-trules.txt
Further reading
- Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. "Speeding up target-language driven part-of-speech tagger training for machine translation". In Lecture Notes in Computer Science 4293 (Advances in Artificial Intelligence, Proceedings of MICAI 2006, 5th Mexican International Conference on Artificial Intelligence) , p. 844-854, November 13-17, 2006, Apizaco, Mexico.
- Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. "Exploring the use of target-language information to train the part-of-speech tagger of machine translation systems". In Lecture Notes in Computer Science 3230 (Advances in Natural Language Processing, Proceedings of EsTAL - España for Natural Language Processing), p. 137-148, October 20-22, 2004, Alicante, Spain.