Setup for working on morphological dictionaries

From Apertium
Revision as of 12:05, 19 November 2013 by Francis Tyers (talk | contribs) (Created page with "The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus. You will need: * a W...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.

You will need:

  • a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
  • an Apertium monolingual language directory.

Example

Let's suppose we want to make an Urdu corpus.

Step 1

Download the Apertium Urdu module:

$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd

Compile it in the usual way.

Next, make a directory for your corpus.

$ mkdir -p urdu/wikipedia

$ cd urdu/wikipedia

Now, download the corpus, and extract the text:

$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2

$ mkdir output

$ bzcat2 urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/

$ cat output/*/* | strip_html.py > urd.crp.txt

Note: strip_html.py is a script that removes text between < and > from a text file.

After you have your corpus you can generate your frequency list:

$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\
  cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt

You should also make a small script file called hitparade.sh in this directory. It should contain:

cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt