Difference between revisions of "Tagger training"

Revision as of 15:01, 7 August 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger

@@ Line 24: / Line 24: @@
 ==Training the tagger==
-[[en:Documentation]]
+[[Category:Documentation]]

Difference between revisions of "Tagger training"

Revision as of 15:01, 7 August 2007

Contents

Creating a corpus

Wikipedia

Other sources

Writing a TSX file

Training the tagger

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools