Difference between revisions of "Tagger training"
Jump to navigation
Jump to search
Line 14: | Line 14: | ||
===Other sources=== |
===Other sources=== |
||
Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here]. |
Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here] and [http://wt.jrc.it/lt/Acquis/ here]. |
||
==Writing a TSX file== |
==Writing a TSX file== |
Revision as of 15:02, 7 August 2007
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here and here.
Writing a TSX file
A .tsx
file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd
, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx
for the English tagger, and apertium-en-af.af.tsx
for the Afrikaans tagger.