Setup for working on morphological dictionaries

From Apertium
Jump to navigation Jump to search

The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.

You will need:

  • a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
  • an Apertium monolingual language directory.


Let's suppose we want to make an Urdu corpus.

Download the Apertium Urdu module:

$ svn co

Compile it in the usual way:

$ ./
$ ./configure
$ make

Next, make a directory for your corpus.

$ mkdir -p urdu/wikipedia

$ cd urdu/wikipedia

Now, download the corpus, and extract the text:

$ wget

$ mkdir output

$ bzcat urwiki-20131109-pages-articles.xml.bz2| python -o output/

$ cat output/*/* | > urd.crp.txt

$ rm -r output/

Note: is a script that removes text between < and > from a text file.

After you have your corpus you can generate your frequency list:

$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\
  cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt

You should also make a small script file called in this directory. It should contain:

cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt

You will also want a script called, it should look something like:

cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\
 apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt

EDICT=`cat $DIX | grep -e '<e lm' | wc -l`;
EPAR=`cat $DIX | grep '<pardef ' | wc -l`;
TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l`
KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l`
COV=`calc $KNOWN / $TOTAL`;

echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log
tail -1 history.log

And a further file called

cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr  | grep -v '[0-9] [0-9]' > urd.hitparade.txt