Difference between revisions of "Extracting bilingual dictionaries with Giza++"
m |
|||
(12 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium. |
|||
This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium. |
|||
==You need== |
|||
* Giza++ |
|||
* Moses |
|||
* Morphological analyser/disambiguator for each language |
|||
* Some scripts |
|||
==Get your corpus== |
==Get your corpus== |
||
Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files: |
|||
* <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian |
|||
* <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi |
|||
Check to see if the files are the same length: |
|||
<pre> |
|||
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme |
|||
161837 forvaltningsordbok.nob |
|||
161837 forvaltningsordbok.sme |
|||
323674 total |
|||
</pre> |
|||
If the files are not the same length, then you need to go back and check your sentence alignment. |
|||
==Process corpus== |
==Process corpus== |
||
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus: |
|||
In Apertium the easiest way to do this is to run, e.g. |
|||
<pre> |
|||
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ |
|||
apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob |
|||
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ |
|||
apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme |
|||
</pre> |
|||
Then remove the tags which are unnecessary using the process tags script (<code>process-tags.py</code>) (for example number, and tense), e.g. |
|||
<pre> |
|||
$ head forvaltningsordbok.tagged.nob | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob |
|||
$ head forvaltningsordbok.tagged.sme | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme |
|||
</pre> |
|||
You want to end up with results looking something like this: |
|||
<pre> |
|||
$ head forvaltningsordbok.tagged.clean.nob |
|||
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> |
|||
søk<n><nt> hos<pr> kulturdepartement<n><nt> |
|||
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no |
|||
tips<n><nt> en<det><qnt> venn<n><m> |
|||
$ head forvaltningsordbok.tagged.clean.sme |
|||
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> |
|||
ohca<N> kulturdepartemeanta<N> siidu<N> |
|||
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as |
|||
cavgilit<V><TV> ustit<N> |
|||
</pre> |
|||
Make sure your files are the same length: |
|||
<pre> |
|||
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme |
|||
159841 forvaltningsordbok.tagged.clean.nob |
|||
159841 forvaltningsordbok.tagged.clean.sme |
|||
319682 total |
|||
</pre> |
|||
==Align corpus== |
==Align corpus== |
||
Now use Moses to align your corpus in order to get the probabilistic dictionaries: |
|||
<pre> |
|||
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ |
|||
/home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \ |
|||
-f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ |
|||
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & |
|||
</pre> |
|||
Note: Remember to change all the paths in the above command! |
|||
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway. |
|||
This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something. |
|||
==Extract lexicon== |
==Extract lexicon== |
||
The interesting files are in <code>model/</code>. Particularly the <code>lex.e2f</code> and <code>lex.f2e</code> files. |
|||
<pre> |
|||
$ head model/lex.e2f |
|||
*planlovens plánalága<N> 0.1666667 |
|||
*mantallsorganiseringa *SVLa 0.0666667 |
|||
*mantallsorganiseringa *MagnhildMathisen 0.1428571 |
|||
*mantallsorganiseringa sámediggejoavku<N> 0.0020040 |
|||
*mantallsorganiseringa áirras<N> 0.0005531 |
|||
*mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000 |
|||
</pre> |
|||
Sort the <code>e2f</code> file: |
|||
<pre> |
|||
$ cat model/lex.e2f | sort > model/lex.e2f.sorted |
|||
</pre> |
|||
==Prune with relative frequency== |
==Prune with relative frequency== |
||
Use the <code>relative-freq.py</code> script to generate relative frequency lists of in-domain and out-domain corpora. |
|||
<pre> |
|||
</pre> |
|||
Use the <code>extract-candidate-terms.py</code> script to give a confidence to each alignment based on in-domain and out-domain relative frequency. |
|||
<pre> |
|||
While waiting: [[Generating_lexical-selection_rules_from_a_parallel_corpus|This page will shorten the waiting time]] |
|||
$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr |
|||
</pre> |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
Latest revision as of 11:53, 26 September 2016
This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
You need[edit]
- Giza++
- Moses
- Morphological analyser/disambiguator for each language
- Some scripts
Get your corpus[edit]
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus[edit]
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:
In Apertium the easiest way to do this is to run, e.g.
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob $ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme
Then remove the tags which are unnecessary using the process tags script (process-tags.py
) (for example number, and tense), e.g.
$ head forvaltningsordbok.tagged.nob | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob $ head forvaltningsordbok.tagged.sme | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme
You want to end up with results looking something like this:
$ head forvaltningsordbok.tagged.clean.nob 1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> søk<n><nt> hos<pr> kulturdepartement<n><nt> søk<n><nt> på<pr> hel<adj> regjering<n><m> *no tips<n><nt> en<det><qnt> venn<n><m> $ head forvaltningsordbok.tagged.clean.sme 1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> ohca<N> kulturdepartemeanta<N> siidu<N> ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as cavgilit<V><TV> ustit<N>
Make sure your files are the same length:
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme 159841 forvaltningsordbok.tagged.clean.nob 159841 forvaltningsordbok.tagged.clean.sme 319682 total
Align corpus[edit]
Now use Moses to align your corpus in order to get the probabilistic dictionaries:
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \ -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
Note: Remember to change all the paths in the above command!
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.
This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.
Extract lexicon[edit]
The interesting files are in model/
. Particularly the lex.e2f
and lex.f2e
files.
$ head model/lex.e2f *planlovens plánalága<N> 0.1666667 *mantallsorganiseringa *SVLa 0.0666667 *mantallsorganiseringa *MagnhildMathisen 0.1428571 *mantallsorganiseringa sámediggejoavku<N> 0.0020040 *mantallsorganiseringa áirras<N> 0.0005531 *mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000
Sort the e2f
file:
$ cat model/lex.e2f | sort > model/lex.e2f.sorted
Prune with relative frequency[edit]
Use the relative-freq.py
script to generate relative frequency lists of in-domain and out-domain corpora.
Use the extract-candidate-terms.py
script to give a confidence to each alignment based on in-domain and out-domain relative frequency.
$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr