Difference between revisions of "Tagger training"

From Apertium
Jump to navigation Jump to search
Line 24: Line 24:
 
==Training the tagger==
 
==Training the tagger==
   
  +
[[en:Documentation]]
 
  +
 
[[Category:Documentation]]

Revision as of 15:01, 7 August 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger