Difference between revisions of "Extracting bilingual dictionaries with Giza++"
m |
|||
Line 126: | Line 126: | ||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
Latest revision as of 11:53, 26 September 2016
This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
You need[edit]
- Giza++
- Moses
- Morphological analyser/disambiguator for each language
- Some scripts
Get your corpus[edit]
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus[edit]
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:
In Apertium the easiest way to do this is to run, e.g.
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob $ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme
Then remove the tags which are unnecessary using the process tags script (process-tags.py
) (for example number, and tense), e.g.
$ head forvaltningsordbok.tagged.nob | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob $ head forvaltningsordbok.tagged.sme | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme
You want to end up with results looking something like this:
$ head forvaltningsordbok.tagged.clean.nob 1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> søk<n><nt> hos<pr> kulturdepartement<n><nt> søk<n><nt> på<pr> hel<adj> regjering<n><m> *no tips<n><nt> en<det><qnt> venn<n><m> $ head forvaltningsordbok.tagged.clean.sme 1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> ohca<N> kulturdepartemeanta<N> siidu<N> ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as cavgilit<V><TV> ustit<N>
Make sure your files are the same length:
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme 159841 forvaltningsordbok.tagged.clean.nob 159841 forvaltningsordbok.tagged.clean.sme 319682 total
Align corpus[edit]
Now use Moses to align your corpus in order to get the probabilistic dictionaries:
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \ -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
Note: Remember to change all the paths in the above command!
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.
This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.
Extract lexicon[edit]
The interesting files are in model/
. Particularly the lex.e2f
and lex.f2e
files.
$ head model/lex.e2f *planlovens plánalága<N> 0.1666667 *mantallsorganiseringa *SVLa 0.0666667 *mantallsorganiseringa *MagnhildMathisen 0.1428571 *mantallsorganiseringa sámediggejoavku<N> 0.0020040 *mantallsorganiseringa áirras<N> 0.0005531 *mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000
Sort the e2f
file:
$ cat model/lex.e2f | sort > model/lex.e2f.sorted
Prune with relative frequency[edit]
Use the relative-freq.py
script to generate relative frequency lists of in-domain and out-domain corpora.
Use the extract-candidate-terms.py
script to give a confidence to each alignment based on in-domain and out-domain relative frequency.
$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr