Difference between revisions of "Tagger training"

From Apertium
Jump to navigation Jump to search
(New page: ==Creating a corpus== ===Wikipedia=== A basic corpus can be retrieved from Wikipedia as follows: <pre> $ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' |...)
 
Line 1: Line 1:
  +
{{TOCD}}
 
 
==Creating a corpus==
 
==Creating a corpus==
   

Revision as of 18:54, 19 May 2007

Creating a corpus

Wikipedia

A basic corpus can be retrieved from Wikipedia as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g' > mycorpus.txt

Other sources

Some pre-processed corpora can be found here.

Writing a TSX file

Training the tagger

en:Documentation