Difference between revisions of "Setup for working on morphological dictionaries"
Jump to navigation
Jump to search
(Created page with "The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus. You will need: * a W...") |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
Let's suppose we want to make an Urdu corpus. |
Let's suppose we want to make an Urdu corpus. |
||
===Step 1=== |
|||
Download the Apertium Urdu module: |
Download the Apertium Urdu module: |
||
Line 18: | Line 16: | ||
</pre> |
</pre> |
||
Compile it in the usual way |
Compile it in the usual way: |
||
<pre> |
|||
$ ./autogen.sh |
|||
$ ./configure |
|||
$ make |
|||
</pre> |
|||
Next, make a directory for your corpus. |
Next, make a directory for your corpus. |
||
Line 35: | Line 39: | ||
$ mkdir output |
$ mkdir output |
||
$ |
$ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/ |
||
$ cat output/*/* | strip_html.py > urd.crp.txt |
$ cat output/*/* | strip_html.py > urd.crp.txt |
||
$ rm -r output/ |
|||
</pre> |
</pre> |
||
Line 53: | Line 59: | ||
<pre> |
<pre> |
||
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt |
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt |
||
</pre> |
|||
You will also want a script called <code>coverage.sh</code>, it should look something like: |
|||
<pre> |
|||
DIX=/path/to/apertium-urd/apertium-urd.urd.dix |
|||
BIN=/path/to/apertium-urd/urd.automorf.bin |
|||
LANG=urd |
|||
cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\ |
|||
apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt |
|||
EDICT=`cat $DIX | grep -e '<e lm' | wc -l`; |
|||
EPAR=`cat $DIX | grep '<pardef ' | wc -l`; |
|||
TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l` |
|||
KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l` |
|||
COV=`calc $KNOWN / $TOTAL`; |
|||
DATE=`date`; |
|||
echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log |
|||
tail -1 history.log |
|||
</pre> |
|||
And a further file called <code>new-parade.sh</code>: |
|||
<pre> |
|||
cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt |
|||
</pre> |
</pre> |
||
Latest revision as of 17:59, 19 November 2013
The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.
You will need:
- a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
- an Apertium monolingual language directory.
Example[edit]
Let's suppose we want to make an Urdu corpus.
Download the Apertium Urdu module:
$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd
Compile it in the usual way:
$ ./autogen.sh $ ./configure $ make
Next, make a directory for your corpus.
$ mkdir -p urdu/wikipedia $ cd urdu/wikipedia
Now, download the corpus, and extract the text:
$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2 $ mkdir output $ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/ $ cat output/*/* | strip_html.py > urd.crp.txt $ rm -r output/
Note: strip_html.py
is a script that removes text between <
and >
from a text file.
After you have your corpus you can generate your frequency list:
$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\ cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt
You should also make a small script file called hitparade.sh
in this directory. It should contain:
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt
You will also want a script called coverage.sh
, it should look something like:
DIX=/path/to/apertium-urd/apertium-urd.urd.dix BIN=/path/to/apertium-urd/urd.automorf.bin LANG=urd cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\ apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt EDICT=`cat $DIX | grep -e '<e lm' | wc -l`; EPAR=`cat $DIX | grep '<pardef ' | wc -l`; TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l` KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l` COV=`calc $KNOWN / $TOTAL`; DATE=`date`; echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log tail -1 history.log
And a further file called new-parade.sh
:
cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt