Difference between revisions of "Generating lexical-selection rules"

Revision as of 19:05, 11 March 2010

Preparation

Wikipedia

Wikipedia can be downloaded from downloads.wikimedia.org.

$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2

You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.

$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt

Then tag the corpus:

$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt

Language model

It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in IRSTLM binary format and is called en.blm. Instructions on how to make one of these may be added here in future.

Steps

@@ Line 24: / Line 24: @@
 ===Language model===
-It is assumed you have a language model left over from previous NLP experiments, that this model is in IRSTLM binary format and is called <code>en.blm</code>. Instructions on how to make one of these may be added here in future.
+It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in [[IRSTLM]] binary format and is called <code>en.blm</code>. Instructions on how to make one of these may be added here in future.
 == Steps ==

Difference between revisions of "Generating lexical-selection rules"

Revision as of 19:05, 11 March 2010

Contents

Preparation

Wikipedia

Language model

Steps

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools