Difference between revisions of "Tagger training"

From Apertium
Jump to navigation Jump to search
Line 18: Line 18:
 
==Writing a TSX file==
 
==Writing a TSX file==
   
  +
A <code>.tsx</code> file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in <code>tagger.dtd</code>, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
  +
  +
The file should be in the language pair directory and be called (in for example English-Afrikaans), <code>apertium-en-af.en.tsx</code> for the English tagger, and <code>apertium-en-af.af.tsx</code> for the Afrikaans tagger.
   
 
==Training the tagger==
 
==Training the tagger==

Revision as of 15:01, 7 August 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger

en:Documentation