Difference between revisions of "Setup for working on morphological dictionaries"
Jump to navigation
Jump to search
(4 intermediate revisions by the same user not shown) | |||
Line 16: | Line 16: | ||
</pre> |
</pre> |
||
Compile it in the usual way |
Compile it in the usual way: |
||
<pre> |
|||
$ ./autogen.sh |
|||
$ ./configure |
|||
$ make |
|||
</pre> |
|||
Next, make a directory for your corpus. |
Next, make a directory for your corpus. |
||
Line 33: | Line 39: | ||
$ mkdir output |
$ mkdir output |
||
$ |
$ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/ |
||
$ cat output/*/* | strip_html.py > urd.crp.txt |
$ cat output/*/* | strip_html.py > urd.crp.txt |
||
$ rm -r output/ |
|||
</pre> |
</pre> |
||
Line 51: | Line 59: | ||
<pre> |
<pre> |
||
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt |
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt |
||
</pre> |
|||
You will also want a script called <code>coverage.sh</code>, it should look something like: |
|||
<pre> |
|||
DIX=/path/to/apertium-urd/apertium-urd.urd.dix |
|||
BIN=/path/to/apertium-urd/urd.automorf.bin |
|||
LANG=urd |
|||
cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\ |
|||
apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt |
|||
EDICT=`cat $DIX | grep -e '<e lm' | wc -l`; |
|||
EPAR=`cat $DIX | grep '<pardef ' | wc -l`; |
|||
TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l` |
|||
KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l` |
|||
COV=`calc $KNOWN / $TOTAL`; |
|||
DATE=`date`; |
|||
echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log |
|||
tail -1 history.log |
|||
</pre> |
|||
And a further file called <code>new-parade.sh</code>: |
|||
<pre> |
|||
cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt |
|||
</pre> |
</pre> |
||
Latest revision as of 17:59, 19 November 2013
The most important thing when working on morphological dictionaries is to add words by frequency, and be able to check your morphology against a corpus.
You will need:
- a Wikipedia dump (see here), (the pages-articles.xml.bz2 file) and Wikipedia Extractor.
- an Apertium monolingual language directory.
Example[edit]
Let's suppose we want to make an Urdu corpus.
Download the Apertium Urdu module:
$ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-urd
Compile it in the usual way:
$ ./autogen.sh $ ./configure $ make
Next, make a directory for your corpus.
$ mkdir -p urdu/wikipedia $ cd urdu/wikipedia
Now, download the corpus, and extract the text:
$ wget http://dumps.wikimedia.org/urwiki/20131109/urwiki-20131109-pages-articles.xml.bz2 $ mkdir output $ bzcat urwiki-20131109-pages-articles.xml.bz2| python WikiExtractor.py -o output/ $ cat output/*/* | strip_html.py > urd.crp.txt $ rm -r output/
Note: strip_html.py
is a script that removes text between <
and >
from a text file.
After you have your corpus you can generate your frequency list:
$ cat urd.crp.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' |\ cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt
You should also make a small script file called hitparade.sh
in this directory. It should contain:
cat urd.hitparade.txt | apertium-destxt | lt-proc /path/to/urd.automorf.bin | apertium-retxt
You will also want a script called coverage.sh
, it should look something like:
DIX=/path/to/apertium-urd/apertium-urd.urd.dix BIN=/path/to/apertium-urd/urd.automorf.bin LANG=urd cat $LANG.crp.txt | cut -f2 | grep -v '>(' | sed 's/</</g' | sed 's/>/>/g' | apertium-destxt | lt-proc $BIN |\ apertium-retxt | sed 's/\$\W*\^/$\n^/g' > /tmp/$LANG.coverage.txt EDICT=`cat $DIX | grep -e '<e lm' | wc -l`; EPAR=`cat $DIX | grep '<pardef ' | wc -l`; TOTAL=`cat /tmp/$LANG.coverage.txt | wc -l` KNOWN=`cat /tmp/$LANG.coverage.txt | grep -v '*' | wc -l` COV=`calc $KNOWN / $TOTAL`; DATE=`date`; echo -e $DATE"\t"$EPAR":"$EDICT"\t"$KNOWN"/"$TOTAL"\t"$COV >> history.log tail -1 history.log
And a further file called new-parade.sh
:
cat /tmp/urd.coverage.txt | cut -f2 -d'^' | cut -f1 -d'/' | sort -f | uniq -c | sort -gr | grep -v '[0-9] [0-9]' > urd.hitparade.txt