Difference between revisions of "Extracting bilingual dictionaries with Giza++"

Latest revision as of 11:53, 26 September 2016

You need[edit]

Giza++
Moses
Morphological analyser/disambiguator for each language
Some scripts

Get your corpus[edit]

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

forvaltningsordbok.nob: A list of sentences in Norwegian
forvaltningsordbok.sme: Translations of the previous sentences in North Sámi

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

Process corpus[edit]

The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:

In Apertium the easiest way to do this is to run, e.g.

$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\
    apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\
    apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme

Then remove the tags which are unnecessary using the process tags script (process-tags.py) (for example number, and tense), e.g.

$ head forvaltningsordbok.tagged.nob  | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob
$ head forvaltningsordbok.tagged.sme  | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme

You want to end up with results looking something like this:

$ head forvaltningsordbok.tagged.clean.nob
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt>
søk<n><nt> hos<pr> kulturdepartement<n><nt>
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no
tips<n><nt> en<det><qnt> venn<n><m>

$ head forvaltningsordbok.tagged.clean.sme
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR>
ohca<N> kulturdepartemeanta<N> siidu<N>
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as
cavgilit<V><TV> ustit<N>

Make sure your files are the same length:

$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme
  159841 forvaltningsordbok.tagged.clean.nob
  159841 forvaltningsordbok.tagged.clean.sme
  319682 total

Align corpus[edit]

Now use Moses to align your corpus in order to get the probabilistic dictionaries:

$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
 /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \
 -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.

Extract lexicon[edit]

The interesting files are in model/. Particularly the lex.e2f and lex.f2e files.

$ head model/lex.e2f
*planlovens plánalága<N> 0.1666667
*mantallsorganiseringa *SVLa 0.0666667
*mantallsorganiseringa *MagnhildMathisen 0.1428571
*mantallsorganiseringa sámediggejoavku<N> 0.0020040
*mantallsorganiseringa áirras<N> 0.0005531
*mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000

Sort the e2f file:

$ cat model/lex.e2f | sort > model/lex.e2f.sorted

Prune with relative frequency[edit]

Use the relative-freq.py script to generate relative frequency lists of in-domain and out-domain corpora.

Use the extract-candidate-terms.py script to give a confidence to each alignment based on in-domain and out-domain relative frequency.

$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr

Difference between revisions of "Extracting bilingual dictionaries with Giza++"

Latest revision as of 11:53, 26 September 2016

Contents

You need[edit]

Get your corpus[edit]

Process corpus[edit]

Align corpus[edit]

Extract lexicon[edit]

Prune with relative frequency[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+{{TOCD}}
-This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
+This page explains how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
-While waiting: [[Generating_lexical-selection_rules_from_a_parallel_corpus|This page will shorten the waiting time]]
+==You need==
+* Giza++
+* Moses
+* Morphological analyser/disambiguator for each language
+* Some scripts
+==Get your corpus==
+Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files:
+* <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian
+* <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi
+Check to see if the files are the same length:
+<pre>
+$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme
+forvaltningsordbok.nob
+forvaltningsordbok.sme
+total
+</pre>
+If the files are not the same length, then you need to go back and check your sentence alignment.
+==Process corpus==
+The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:
+In Apertium the easiest way to do this is to run, e.g.
+<pre>
+$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\
+    apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob
+$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\
+    apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme
+</pre>
+Then remove the tags which are unnecessary using the process tags script (<code>process-tags.py</code>) (for example number, and tense), e.g.
+<pre>
+$ head forvaltningsordbok.tagged.nob  | python process-tags.py nob.process-relabel > forvaltningsordbok.tagged.clean.nob
+$ head forvaltningsordbok.tagged.sme  | python process-tags.py sme.process-relabel > forvaltningsordbok.tagged.clean.sme
+</pre>
+You want to end up with results looking something like this:
+<pre>
+$ head forvaltningsordbok.tagged.clean.nob
+<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt>
+søk<n><nt> hos<pr> kulturdepartement<n><nt>
+søk<n><nt> på<pr> hel<adj> regjering<n><m> *no
+tips<n><nt> en<det><qnt> venn<n><m>
+$ head forvaltningsordbok.tagged.clean.sme
+_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR>
+ohca<N> kulturdepartemeanta<N> siidu<N>
+ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as
+cavgilit<V><TV> ustit<N>
+</pre>
+Make sure your files are the same length:
+<pre>
+$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme
+forvaltningsordbok.tagged.clean.nob
+forvaltningsordbok.tagged.clean.sme
+total
+</pre>
+==Align corpus==
+Now use Moses to align your corpus in order to get the probabilistic dictionaries:
+<pre>
+$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
+ /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus forvaltningsordbok.tagged.clean \
+ -f sme -e nob -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
+ -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
+</pre>
+Note: Remember to change all the paths in the above command!
+You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.
+This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.
+==Extract lexicon==
+The interesting files are in <code>model/</code>. Particularly the <code>lex.e2f</code> and <code>lex.f2e</code> files.
+<pre>
+$ head model/lex.e2f
+*planlovens plánalága<N> 0.1666667
+*mantallsorganiseringa *SVLa 0.0666667
+*mantallsorganiseringa *MagnhildMathisen 0.1428571
+*mantallsorganiseringa sámediggejoavku<N> 0.0020040
+*mantallsorganiseringa áirras<N> 0.0005531
+*mantallsorganiseringa guolástus+ráddi+addit<V><TV> 0.2500000
+</pre>
+Sort the <code>e2f</code> file:
+<pre>
+$ cat model/lex.e2f | sort > model/lex.e2f.sorted
+</pre>
+==Prune with relative frequency==
+Use the <code>relative-freq.py</code> script to generate relative frequency lists of in-domain and out-domain corpora.
+<pre>
+</pre>
+Use the <code>extract-candidate-terms.py</code> script to give a confidence to each alignment based on in-domain and out-domain relative frequency.
+<pre>
+$ python extract-candidate-terms.py lex.e2f.sorted forvaltningsordbok.relfreq.nob no.crp.txt.relfreq forvaltningsordbok.freq.nob no.crp.txt.freq | sort -gr
+</pre>
+[[Category:Documentation]]
+[[Category:Documentation in English]]