Difference between revisions of "Generating lexical-selection rules"
Jump to navigation
Jump to search
Line 14: | Line 14: | ||
<pre> |
<pre> |
||
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt |
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt |
||
</pre> |
|||
Then tag the corpus: |
|||
<pre> |
|||
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt |
|||
</pre> |
</pre> |
||
Revision as of 11:36, 11 March 2010
Preparation
Wikipedia
Wikipedia can be downloaded from downloads.wikimedia.org.
$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2
You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia
that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt
Then tag the corpus:
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt
Language model
It is assumed you have a language model left over from previous NLP experiments, that this model is in IRSTLM binary format and is called en.blm
. Instructions on how to make one of these may be added here in future.