Tagger training
Once your dictionaries are of a reasonable size, say perhaps 3,000 lemmata in total, it is worth training the tagger. To do this, you'll need a couple of things, a decent sized corpus, either tagged or untagged, and a .tsx
file. The basic instructions may be found below.
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here and here.
Writing a TSX file
A .tsx
file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd
, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx
for the English tagger, and apertium-en-af.af.tsx
for the Afrikaans tagger.
The TSX file defines a set of "coarse tags" for groups of "fine tags", this is done because the POS tagging module does not need so much information as is defined in the fine tags. It also allows the user to apply a set of restrictions or enforcements. For example to forbid a relative adverb at the start of a sentence (SENT RELADV
, or to forbid a pronoun after a noun (NOM PRNPERS
).
You can also write lexical rules, so for example in Afrikaans, the word "deur" is polysemic, one meaning is "by" (as a preposition) and the other is "door" (as a noun). So we can define two course tags, DEURNOM
and DEURPR
, and then a forbid rule to say "forbid 'door' before 'the'".
It is worth considering this file carefully and probably also consulting with a linguist, as the tagger can make a big difference to the quality of the final translation.
<?xml version="1.0" encoding="UTF-8"?> <tagger name="afrikaans"> <tagset> <def-label name="DEURNOM" closed="true"> <tags-item lemma="deur" tags="n.*"/> </def-label> <def-label name="DEURPR" closed="true"> <tags-item lemma="deur" tags="pr"/> </def-label> <def-label name="NOM" closed="true"> <tags-item tags="n.*"/> </def-label> <def-label name="PRPERS" closed="true"> <tags-item tags="prpers.*"/> </def-label> <def-label name="DET" closed="true"> <tags-item tags="det.*"/> </def-label> </tagset> <forbid> <label-sequence> <label-item label="NOM"/> <label-item label="PRNPERS"/> </label-sequence> <label-sequence> <label-item label="DEURNOM"/> <label-item label="DET"/> </label-sequence> </forbid> </tagger>
You will need enough coarse tags to cover all the fine tags in your dictionaries.
Training the tagger
A brief note on the various kinds of training that you can do:
- Unsupervised — This uses a large (hundreds of thousands of words) untagged corpus and the iterative Baum-Welch algorithm in a wholely unsupervised manner. This is the least effective way of training the tagger, but is also the cheapest in terms of time and resources.
- Supervised — This uses a medium sized tagged corpus.
- Using
apertium-tagger-trainer
—
Unsupervised
First, make a directory called <lang>-tagger-data
. Put the corpus you downloaded into there with a name like <lang>.crp.txt
. Make sure the corpus is in raw text format with one sentence per line.
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af
I took the Makefile from apertium-en-ca
. The file that you need is called en-ca-unsupervised.make
.
Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME
, LANG1
, and LANG2
. Everything else should be fine.
Now run:
$ make -f en-af-unsupervised.make
and wait... you should get some output like:
Generating en-tagger-data/en.dic This may take some time. Please, take a cup of coffee and come back later. apertium-validate-dictionary apertium-en-af.en.dix apertium-validate-tagger apertium-en-af.en.tsx lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\ awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded lt-proc -a en-af.automorf.bin <en.dic.expanded | \ apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic rm en.dic.expanded; apertium-destxt < en-tagger-data/en.crp.txt | lt-proc en-af.automorf.bin > en-tagger-data/en.crp apertium-validate-tagger apertium-en-af.en.tsx apertium-tagger -t 8 \ en-tagger-data/en.dic \ en-tagger-data/en.crp \ apertium-en-af.en.tsx \ en-af.prob; Calculating ambiguity classes... Kupiec's initialization of transition and emission probabilities... Applying forbid and enforce rules... Training (Baum-Welch)... Applying forbid and enforce rules...
And after this you should have a en-af.prob
file, which can be used with the apertium-tagger
module.
Supervised
Using apertium-tagger-trainer
There is a package called apertium-tagger-trainer
that trains taggers based on both source and target language information. The resulting probability files are as good as supervised training, but much quicker to produce, and with less effort.