Difference between revisions of "Tagger training"
Line 24: | Line 24: | ||
==Training the tagger== |
==Training the tagger== |
||
+ | ===Unsupervised=== |
||
+ | First, make a directory called <code><lang>-tagger-data</code>. Put the corpus you downloaded into there with a name like <code><lang>.crp.txt</code>. Make sure the corpus is in raw text format with one sentence per line. |
||
+ | |||
+ | Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For <code>apertium-en-af</code> I took the Makefile from <code>apertium-en-ca</code>. The file that you need is called <code>en-ca-unsupervised.make</code>. |
||
+ | |||
+ | Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, <code>BASENAME</code>, <code>LANG1</code>, and <code>LANG2</code>. Everything else should be fine. |
||
+ | |||
+ | Now run: |
||
+ | |||
+ | <pre> |
||
+ | $ make -f en-af-unsupervised.make |
||
+ | </pre> |
||
+ | |||
+ | and wait... you should get some output like: |
||
+ | |||
+ | <pre> |
||
+ | Generating en-tagger-data/en.dic |
||
+ | This may take some time. Please, take a cup of coffee and come back later. |
||
+ | apertium-validate-dictionary apertium-en-af.en.dix |
||
+ | apertium-validate-tagger apertium-en-af.en.tsx |
||
+ | lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\ |
||
+ | awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded |
||
+ | lt-proc -a en-af.automorf.bin <en.dic.expanded | \ |
||
+ | apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic |
||
+ | </pre> |
||
[[Category:Documentation]] |
[[Category:Documentation]] |
Revision as of 15:22, 7 August 2007
Creating a corpus
Wikipedia
A basic corpus can be retrieved from Wikipedia as follows:
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
Other sources
Some pre-processed corpora can be found here and here.
Writing a TSX file
A .tsx
file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in tagger.dtd
, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx
for the English tagger, and apertium-en-af.af.tsx
for the Afrikaans tagger.
Training the tagger
Unsupervised
First, make a directory called <lang>-tagger-data
. Put the corpus you downloaded into there with a name like <lang>.crp.txt
. Make sure the corpus is in raw text format with one sentence per line.
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af
I took the Makefile from apertium-en-ca
. The file that you need is called en-ca-unsupervised.make
.
Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME
, LANG1
, and LANG2
. Everything else should be fine.
Now run:
$ make -f en-af-unsupervised.make
and wait... you should get some output like:
Generating en-tagger-data/en.dic This may take some time. Please, take a cup of coffee and come back later. apertium-validate-dictionary apertium-en-af.en.dix apertium-validate-tagger apertium-en-af.en.tsx lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\ awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded lt-proc -a en-af.automorf.bin <en.dic.expanded | \ apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic