Difference between revisions of "Extracting bilingual dictionaries with Giza++"
Line 22: | Line 22: | ||
==Process corpus== |
==Process corpus== |
||
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus: |
|||
In Apertium the easiest way to do this is to run, e.g. |
|||
<pre> |
|||
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ |
|||
apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob |
|||
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ |
|||
apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme |
|||
</pre> |
|||
Then remove the tags which are unnecessary (for example number, and tense): |
|||
<pre> |
|||
</pre> |
|||
You want to end up with results looking something like this: |
|||
<pre> |
|||
$ head forvaltningsordbok.tagged.clean.nob |
|||
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> |
|||
søk<n><nt> hos<pr> kulturdepartement<n><nt> |
|||
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no |
|||
tips<n><nt> en<det><qnt> venn<n><m> |
|||
$ head forvaltningsordbok.tagged.clean.sme |
|||
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> |
|||
ohca<N> kulturdepartemeanta<N> siidu<N> |
|||
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as |
|||
cavgilit<V><TV> ustit<N> |
|||
</pre> |
|||
Make sure your files are the same length: |
|||
<pre> |
|||
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme |
|||
159841 forvaltningsordbok.tagged.clean.nob |
|||
159841 forvaltningsordbok.tagged.clean.sme |
|||
319682 total |
|||
</pre> |
|||
==Align corpus== |
==Align corpus== |
Revision as of 13:45, 7 February 2012
This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
Get your corpus
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus
The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:
In Apertium the easiest way to do this is to run, e.g.
$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\ apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob $ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\ apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme
Then remove the tags which are unnecessary (for example number, and tense):
You want to end up with results looking something like this:
$ head forvaltningsordbok.tagged.clean.nob 1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt> søk<n><nt> hos<pr> kulturdepartement<n><nt> søk<n><nt> på<pr> hel<adj> regjering<n><m> *no tips<n><nt> en<det><qnt> venn<n><m> $ head forvaltningsordbok.tagged.clean.sme 1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR> ohca<N> kulturdepartemeanta<N> siidu<N> ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as cavgilit<V><TV> ustit<N>
Make sure your files are the same length:
$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme 159841 forvaltningsordbok.tagged.clean.nob 159841 forvaltningsordbok.tagged.clean.sme 319682 total
Align corpus
Extract lexicon
Prune with relative frequency
While waiting: This page will shorten the waiting time