Difference between revisions of "Tagger training"

Revision as of 15:02, 7 August 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here and here.

Writing a TSX file

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

Training the tagger

@@ Line 14: / Line 14: @@
 ===Other sources===
-Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here].
+Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here] and [http://wt.jrc.it/lt/Acquis/ here].
 ==Writing a TSX file==

Difference between revisions of "Tagger training"

Revision as of 15:02, 7 August 2007

Contents

Creating a corpus

Wikipedia

Other sources

Writing a TSX file

Training the tagger

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools