Extracting bilingual dictionaries with Giza++

From Apertium
Jump to navigation Jump to search

This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.

Get your corpus

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

  • forvaltningsordbok.nob: A list of sentences in Norwegian
  • forvaltningsordbok.sme: Translations of the previous sentences in North Sámi

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

Process corpus

The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:

In Apertium the easiest way to do this is to run, e.g.

$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\
    apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\
    apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme

Then remove the tags which are unnecessary using the process tags script (process-tags.py) (for example number, and tense), e.g.

$ head forvaltningsordbok.tagged.nob  | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob
$ head forvaltningsordbok.tagged.sme  | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme

You want to end up with results looking something like this:

$ head forvaltningsordbok.tagged.clean.nob
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt>
søk<n><nt> hos<pr> kulturdepartement<n><nt>
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no
tips<n><nt> en<det><qnt> venn<n><m>

$ head forvaltningsordbok.tagged.clean.sme
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR>
ohca<N> kulturdepartemeanta<N> siidu<N>
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as
cavgilit<V><TV> ustit<N>

Make sure your files are the same length:

$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme
  159841 forvaltningsordbok.tagged.clean.nob
  159841 forvaltningsordbok.tagged.clean.sme
  319682 total

Align corpus

Now use Moses to align your corpus in order to get the probabilistic dictionaries:

$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
 /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \
 -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.

Extract lexicon

The interesting files are in model/. Particularly the lex.e2f and lex.f2e files.

$ head model/lex.e2f
*planlovens plánalága<N> 0.1666667
*mantallsorganiseringa *SVLa 0.0666667
*mantallsorganiseringa *MagnhildMathisen 0.1428571
*mantallsorganiseringa sámediggejoavku<N> 0.0020040
*mantallsorganiseringa áirras<N> 0.0005531
*mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000

Sort the e2f file:

$ cat model/lex.e2f | sort > model/lex.e2f.sorted

Prune with relative frequency

Use the relative-freq.py script to generate relative frequency lists of in-domain and out-domain corpora.


Use the extract-candidate-terms.py script to give a confidence to each alignment based on in-domain and out-domain relative frequency.

$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr