Tagger training

From Apertium
Jump to navigation Jump to search

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here and here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger

A brief note on the various kinds of training that you can do:

  • Unsupervised — This uses a large (hundreds of thousands of words) untagged corpus and the iterative Baum-Welch algorithm in a wholely unsupervised manner.
  • Supervised — This uses a medium sized tagged corpus.
  • Using apertium-tagger-trainer

Unsupervised

First, make a directory called <lang>-tagger-data. Put the corpus you downloaded into there with a name like <lang>.crp.txt. Make sure the corpus is in raw text format with one sentence per line.

Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af I took the Makefile from apertium-en-ca. The file that you need is called en-ca-unsupervised.make.

Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME, LANG1, and LANG2. Everything else should be fine.

Now run:

$ make -f en-af-unsupervised.make

and wait... you should get some output like:

Generating en-tagger-data/en.dic
This may take some time. Please, take a cup of coffee and come back later.
apertium-validate-dictionary apertium-en-af.en.dix
apertium-validate-tagger apertium-en-af.en.tsx
lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\
        awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded
lt-proc -a en-af.automorf.bin <en.dic.expanded | \
        apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic
rm en.dic.expanded;
apertium-destxt < en-tagger-data/en.crp.txt | lt-proc en-af.automorf.bin > en-tagger-data/en.crp
apertium-validate-tagger apertium-en-af.en.tsx
apertium-tagger -t 8 \
                           en-tagger-data/en.dic \
                           en-tagger-data/en.crp \
                           apertium-en-af.en.tsx \
                           en-af.prob;
Calculating ambiguity classes...
Kupiec's initialization of transition and emission probabilities...

Supervised

Using apertium-tagger-trainer

There is a package called apertium-tagger-trainer that trains taggers based on both source and target language information. The resulting probability files are as good as supervised training, but much quicker to produce, and with less effort.