Difference between revisions of "Setup for working on morphological dictionaries"

Revision as of 12:06, 19 November 2013

The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.

You will need:

a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
an Apertium monolingual language directory.

Example

Let's suppose we want to make an Urdu corpus.

Download the Apertium Urdu module:

$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd

Compile it in the usual way.

Next, make a directory for your corpus.

$ mkdir -p urdu/wikipedia

$ cd urdu/wikipedia

Now, download the corpus, and extract the text:

$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2

$ mkdir output

$ bzcat2 urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/

$ cat output/*/* | strip_html.py > urd.crp.txt

Note: strip_html.py is a script that removes text between < and > from a text file.

After you have your corpus you can generate your frequency list:

$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\
  cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt

You should also make a small script file called hitparade.sh in this directory. It should contain:

cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt

Difference between revisions of "Setup for working on morphological dictionaries"

Revision as of 12:06, 19 November 2013

Example

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 9: / Line 9: @@
 Let's suppose we want to make an Urdu corpus.
-===Step 1===
 Download the Apertium Urdu module: