Tagger training
Revision as of 18:54, 19 May 2007 by Francis Tyers (talk | contribs) (New page: ==Creating a corpus== ===Wikipedia=== A basic corpus can be retrieved from Wikipedia as follows: <pre> $ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' |...)
Contents
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here.