Difference between revisions of "Generating lexical-selection rules"

From Apertium
Jump to navigation Jump to search
(Created page with ' == Download Wikipedia == Wikipedia can be downloaded from [http://downloads.wikimedia.org downloads.wikimedia.org]. <pre> $ wget http://download.wikimedia.org/iswiki/20100306…')
 
Line 1: Line 1:
   
  +
== Preparation ==
   
== Download Wikipedia ==
+
===Wikipedia===
   
 
Wikipedia can be downloaded from [http://downloads.wikimedia.org downloads.wikimedia.org].
 
Wikipedia can be downloaded from [http://downloads.wikimedia.org downloads.wikimedia.org].
Line 9: Line 10:
 
</pre>
 
</pre>
   
  +
You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in <code>apertium-lex-learner/wikipedia</code> that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.
  +
  +
<pre>
  +
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin
  +
</pre>
   
   

Revision as of 11:22, 11 March 2010

Preparation

Wikipedia

Wikipedia can be downloaded from downloads.wikimedia.org.

$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2

You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.

$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin