Tagger training

From Apertium
Revision as of 18:54, 19 May 2007 by Francis Tyers (talk | contribs) (New page: ==Creating a corpus== ===Wikipedia=== A basic corpus can be retrieved from Wikipedia as follows: <pre> $ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' |...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here.

Writing a TSX file

Training the tagger

en:Documentation