Extracting bilingual dictionaries with Giza++

From Apertium
Jump to: navigation, search


This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.

[edit] You need

  • Giza++
  • Moses
  • Morphological analyser/disambiguator for each language
  • Some scripts

[edit] Get your corpus

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

  • forvaltningsordbok.nob: A list of sentences in Norwegian
  • forvaltningsordbok.sme: Translations of the previous sentences in North Sámi

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

[edit] Process corpus

The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:

In Apertium the easiest way to do this is to run, e.g.

$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\
    apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\
    apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme

Then remove the tags which are unnecessary using the process tags script (process-tags.py) (for example number, and tense), e.g.

$ head forvaltningsordbok.tagged.nob  | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob
$ head forvaltningsordbok.tagged.sme  | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme

You want to end up with results looking something like this:

$ head forvaltningsordbok.tagged.clean.nob
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt>
søk<n><nt> hos<pr> kulturdepartement<n><nt>
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no
tips<n><nt> en<det><qnt> venn<n><m>

$ head forvaltningsordbok.tagged.clean.sme
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR>
ohca<N> kulturdepartemeanta<N> siidu<N>
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as
cavgilit<V><TV> ustit<N>

Make sure your files are the same length:

$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme
  159841 forvaltningsordbok.tagged.clean.nob
  159841 forvaltningsordbok.tagged.clean.sme
  319682 total

[edit] Align corpus

Now use Moses to align your corpus in order to get the probabilistic dictionaries:

$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
 /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \
 -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.

[edit] Extract lexicon

The interesting files are in model/. Particularly the lex.e2f and lex.f2e files.

$ head model/lex.e2f
*planlovens plánalága<N> 0.1666667
*mantallsorganiseringa *SVLa 0.0666667
*mantallsorganiseringa *MagnhildMathisen 0.1428571
*mantallsorganiseringa sámediggejoavku<N> 0.0020040
*mantallsorganiseringa áirras<N> 0.0005531
*mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000

Sort the e2f file:

$ cat model/lex.e2f | sort > model/lex.e2f.sorted

[edit] Prune with relative frequency

Use the relative-freq.py script to generate relative frequency lists of in-domain and out-domain corpora.

Use the extract-candidate-terms.py script to give a confidence to each alignment based on in-domain and out-domain relative frequency.

$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr
Personal tools