Difference between revisions of "Tagger training"

From Apertium
Jump to navigation Jump to search
Line 14: Line 14:
 
===Other sources===
 
===Other sources===
   
Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here].
+
Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here] and [http://wt.jrc.it/lt/Acquis/ here].
   
 
==Writing a TSX file==
 
==Writing a TSX file==

Revision as of 15:02, 7 August 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here and here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger