Tagger training
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here and here.
Writing a TSX file
A .tsx
file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd
, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx
for the English tagger, and apertium-en-af.af.tsx
for the Afrikaans tagger.
Training the tagger
Unsupervised
First, make a directory called <lang>-tagger-data
. Put the corpus you downloaded into there with a name like <lang>.crp.txt
. Make sure the corpus is in raw text format with one sentence per line.
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af
I took the Makefile from apertium-en-ca
. The file that you need is called en-ca-unsupervised.make
.
Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME
, LANG1
, and LANG2
. Everything else should be fine.
Now run:
$ make -f en-af-unsupervised.make
and wait... you should get some output like:
Generating en-tagger-data/en.dic This may take some time. Please, take a cup of coffee and come back later. apertium-validate-dictionary apertium-en-af.en.dix apertium-validate-tagger apertium-en-af.en.tsx lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\ awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded lt-proc -a en-af.automorf.bin <en.dic.expanded | \ apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic