Difference between revisions of "Extracting bilingual dictionaries with Giza++"
m |
m |
||
(13 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
− | This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium. |
||
+ | This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium. |
||
− | While waiting: [[Generating_lexical-selection_rules_from_a_parallel_corpus|This page will shorten the waiting time]] |
||
+ | |||
+ | ==You need== |
||
+ | |||
+ | * Giza++ |
||
+ | * Moses |
||
+ | * Morphological analyser/disambiguator for each language |
||
+ | * Some scripts |
||
+ | |||
+ | ==Get your corpus== |
||
+ | |||
+ | Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files: |
||
+ | |||
+ | * <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian |
||
+ | * <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi |
||
+ | |||
+ | Check to see if the files are the same length: |
||
+ | |||
+ | <pre> |
||
+ | $ wc -l forvaltningsordbok.nob forvaltningsordbok.sme |
||
+ | 161837 forvaltningsordbok.nob |
||
+ | 161837 forvaltningsordbok.sme |
||
+ | 323674 total |
||
+ | </pre> |
||
+ | |||
+ | If the files are not the same length, then you need to go back and check your sentence alignment. |
||
+ | |||
+ | ==Process corpus== |
||
+ | |||
+ | The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus: |
||
+ | |||
+ | In Apertium the easiest way to do this is to run, e.g. |
||
+ | |||
+ | <pre> |
||
+ | $ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ |
||
+ | apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob |
||
+ | $ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ |
||
+ | apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme |
||
+ | </pre> |
||
+ | |||
+ | Then remove the tags which are unnecessary using the process tags script (<code>process-tags.py</code>) (for example number, and tense), e.g. |
||
+ | |||
+ | <pre> |
||
+ | $ head forvaltningsordbok.tagged.nob | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob |
||
+ | $ head forvaltningsordbok.tagged.sme | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme |
||
+ | </pre> |
||
+ | |||
+ | You want to end up with results looking something like this: |
||
+ | |||
+ | <pre> |
||
+ | $ head forvaltningsordbok.tagged.clean.nob |
||
+ | 1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> |
||
+ | søk<n><nt> hos<pr> kulturdepartement<n><nt> |
||
+ | søk<n><nt> på<pr> hel<adj> regjering<n><m> *no |
||
+ | tips<n><nt> en<det><qnt> venn<n><m> |
||
+ | |||
+ | $ head forvaltningsordbok.tagged.clean.sme |
||
+ | 1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> |
||
+ | ohca<N> kulturdepartemeanta<N> siidu<N> |
||
+ | ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as |
||
+ | cavgilit<V><TV> ustit<N> |
||
+ | </pre> |
||
+ | |||
+ | Make sure your files are the same length: |
||
+ | |||
+ | <pre> |
||
+ | $ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme |
||
+ | 159841 forvaltningsordbok.tagged.clean.nob |
||
+ | 159841 forvaltningsordbok.tagged.clean.sme |
||
+ | 319682 total |
||
+ | </pre> |
||
+ | |||
+ | ==Align corpus== |
||
+ | |||
+ | Now use Moses to align your corpus in order to get the probabilistic dictionaries: |
||
+ | |||
+ | <pre> |
||
+ | $ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ |
||
+ | /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \ |
||
+ | -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ |
||
+ | -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & |
||
+ | </pre> |
||
+ | |||
+ | Note: Remember to change all the paths in the above command! |
||
+ | |||
+ | You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway. |
||
+ | |||
+ | This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something. |
||
+ | |||
+ | ==Extract lexicon== |
||
+ | |||
+ | The interesting files are in <code>model/</code>. Particularly the <code>lex.e2f</code> and <code>lex.f2e</code> files. |
||
+ | |||
+ | <pre> |
||
+ | $ head model/lex.e2f |
||
+ | *planlovens plánalága<N> 0.1666667 |
||
+ | *mantallsorganiseringa *SVLa 0.0666667 |
||
+ | *mantallsorganiseringa *MagnhildMathisen 0.1428571 |
||
+ | *mantallsorganiseringa sámediggejoavku<N> 0.0020040 |
||
+ | *mantallsorganiseringa áirras<N> 0.0005531 |
||
+ | *mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000 |
||
+ | </pre> |
||
+ | |||
+ | Sort the <code>e2f</code> file: |
||
+ | |||
+ | <pre> |
||
+ | $ cat model/lex.e2f | sort > model/lex.e2f.sorted |
||
+ | </pre> |
||
+ | |||
+ | ==Prune with relative frequency== |
||
+ | |||
+ | Use the <code>relative-freq.py</code> script to generate relative frequency lists of in-domain and out-domain corpora. |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | </pre> |
||
+ | |||
+ | Use the <code>extract-candidate-terms.py</code> script to give a confidence to each alignment based on in-domain and out-domain relative frequency. |
||
+ | |||
+ | <pre> |
||
+ | $ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | [[Category:Documentation]] |
||
+ | [[Category:Documentation in English]] |
Latest revision as of 11:53, 26 September 2016
This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
You need[edit]
- Giza++
- Moses
- Morphological analyser/disambiguator for each language
- Some scripts
Get your corpus[edit]
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus[edit]
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:
In Apertium the easiest way to do this is to run, e.g.
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob $ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme
Then remove the tags which are unnecessary using the process tags script (process-tags.py
) (for example number, and tense), e.g.
$ head forvaltningsordbok.tagged.nob | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob $ head forvaltningsordbok.tagged.sme | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme
You want to end up with results looking something like this:
$ head forvaltningsordbok.tagged.clean.nob 1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> søk<n><nt> hos<pr> kulturdepartement<n><nt> søk<n><nt> på<pr> hel<adj> regjering<n><m> *no tips<n><nt> en<det><qnt> venn<n><m> $ head forvaltningsordbok.tagged.clean.sme 1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> ohca<N> kulturdepartemeanta<N> siidu<N> ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as cavgilit<V><TV> ustit<N>
Make sure your files are the same length:
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme 159841 forvaltningsordbok.tagged.clean.nob 159841 forvaltningsordbok.tagged.clean.sme 319682 total
Align corpus[edit]
Now use Moses to align your corpus in order to get the probabilistic dictionaries:
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \ -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
Note: Remember to change all the paths in the above command!
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.
This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.
Extract lexicon[edit]
The interesting files are in model/
. Particularly the lex.e2f
and lex.f2e
files.
$ head model/lex.e2f *planlovens plánalága<N> 0.1666667 *mantallsorganiseringa *SVLa 0.0666667 *mantallsorganiseringa *MagnhildMathisen 0.1428571 *mantallsorganiseringa sámediggejoavku<N> 0.0020040 *mantallsorganiseringa áirras<N> 0.0005531 *mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000
Sort the e2f
file:
$ cat model/lex.e2f | sort > model/lex.e2f.sorted
Prune with relative frequency[edit]
Use the relative-freq.py
script to generate relative frequency lists of in-domain and out-domain corpora.
Use the extract-candidate-terms.py
script to give a confidence to each alignment based on in-domain and out-domain relative frequency.
$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr