Difference between revisions of "Tagger training"
Jump to navigation
Jump to search
(New page: ==Creating a corpus== ===Wikipedia=== A basic corpus can be retrieved from Wikipedia as follows: <pre> $ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' |...) |
|||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
− | |||
==Creating a corpus== |
==Creating a corpus== |
||
Revision as of 18:54, 19 May 2007
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here.