Setup for working on morphological dictionaries
Revision as of 12:05, 19 November 2013 by Francis Tyers (talk | contribs) (Created page with "The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus. You will need: * a W...")
The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.
You will need:
- a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
- an Apertium monolingual language directory.
Example
Let's suppose we want to make an Urdu corpus.
Step 1
Download the Apertium Urdu module:
$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd
Compile it in the usual way.
Next, make a directory for your corpus.
$ mkdir -p urdu/wikipedia $ cd urdu/wikipedia
Now, download the corpus, and extract the text:
$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2 $ mkdir output $ bzcat2 urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/ $ cat output/*/* | strip_html.py > urd.crp.txt
Note: strip_html.py
is a script that removes text between <
and >
from a text file.
After you have your corpus you can generate your frequency list:
$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\ cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt
You should also make a small script file called hitparade.sh
in this directory. It should contain:
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt