Difference between revisions of "Setup for working on morphological dictionaries"

From Apertium
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 39: Line 39:
$ mkdir output
$ mkdir output


$ bzcat2 urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/
$ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/


$ cat output/*/* | strip_html.py > urd.crp.txt
$ cat output/*/* | strip_html.py > urd.crp.txt

$ rm -r output/
</pre>
</pre>


Line 65: Line 67:
DIX=/path/to/apertium-urd/apertium-urd.urd.dix
DIX=/path/to/apertium-urd/apertium-urd.urd.dix
BIN=/path/to/apertium-urd/urd.automorf.bin
BIN=/path/to/apertium-urd/urd.automorf.bin
LANG=urd
cat bel.crp.txt | cut -f2 | grep -v '>(' | sed 's/&lt;/</g' | sed 's/&gt;/>/g' | apertium-destxt | lt-proc $BIN |\
cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/&lt;/</g' | sed 's/&gt;/>/g' | apertium-destxt | lt-proc $BIN |\
apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/urd.coverage.txt
apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt




EDICT=`cat $DIX | grep -e '<e lm' | wc -l`;
EDICT=`cat $DIX | grep -e '<e lm' | wc -l`;
EPAR=`cat $DIX | grep '<pardef ' | wc -l`;
EPAR=`cat $DIX | grep '<pardef ' | wc -l`;
TOTAL=`cat /tmp/urd.coverage.txt | wc -l`
TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l`
KNOWN=`cat /tmp/urd.coverage.txt | grep -v '*' | wc -l`
KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l`
COV=`calc $KNOWN / $TOTAL`;
COV=`calc $KNOWN / $TOTAL`;
DATE=`date`;
DATE=`date`;

Latest revision as of 17:59, 19 November 2013

The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.

You will need:

  • a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
  • an Apertium monolingual language directory.

Example[edit]

Let's suppose we want to make an Urdu corpus.

Download the Apertium Urdu module:

$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd

Compile it in the usual way:

$ ./autogen.sh
$ ./configure
$ make

Next, make a directory for your corpus.

$ mkdir -p urdu/wikipedia

$ cd urdu/wikipedia

Now, download the corpus, and extract the text:

$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2

$ mkdir output

$ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/

$ cat output/*/* | strip_html.py > urd.crp.txt

$ rm -r output/

Note: strip_html.py is a script that removes text between < and > from a text file.

After you have your corpus you can generate your frequency list:

$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\
  cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt

You should also make a small script file called hitparade.sh in this directory. It should contain:

cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt

You will also want a script called coverage.sh, it should look something like:


DIX=/path/to/apertium-urd/apertium-urd.urd.dix
BIN=/path/to/apertium-urd/urd.automorf.bin
LANG=urd
cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\
 apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt


EDICT=`cat $DIX | grep -e '<e lm' | wc -l`;
EPAR=`cat $DIX | grep '<pardef ' | wc -l`;
TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l`
KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l`
COV=`calc $KNOWN / $TOTAL`;
DATE=`date`;

echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log
tail -1 history.log

And a further file called new-parade.sh:

cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr  | grep -v '[0-9] [0-9]' > urd.hitparade.txt