Difference between revisions of "Extracting bilingual dictionaries with Giza++"

Revision as of 13:45, 7 February 2012

Get your corpus

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

forvaltningsordbok.nob: A list of sentences in Norwegian
forvaltningsordbok.sme: Translations of the previous sentences in North Sámi

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

Process corpus

The raw corpus has surface forms, but if your languages are morphologically complex, then you would prefer to be aligning based on lemma and part-of-speech. So the first thing is to tag your corpus:

In Apertium the easiest way to do this is to run, e.g.

$ cat forvaltningsordbok.nob | apertium-destxt | lt-proc -w -e nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin |\
    apertium-tagger -g nb-nn.prob > forvaltningsordbok.tagged.nob
$ cat forvaltningsordbok.sme | apertium-destxt | hfst-proc -w sme-nob.automorf.hfst | cg-proc sme-nob.rlx.bin |\
    apertium-tagger -g sme-nob.prob > forvaltningsordbok.tagged.sme

Then remove the tags which are unnecessary (for example number, and tense):

You want to end up with results looking something like this:

$ head forvaltningsordbok.tagged.clean.nob
1<det><qnt> 558,4<det><qnt> million<n><m> krone<n><m> i<pr> *spillemidler til<pr> idrett+formål<n><nt> for<pr> 2010<det><qnt>
søk<n><nt> hos<pr> kulturdepartement<n><nt>
søk<n><nt> på<pr> hel<adj> regjering<n><m> *no
tips<n><nt> en<det><qnt> venn<n><m>

$ head forvaltningsordbok.tagged.clean.sme
1_558,4<Num> miljon<Num> ruvdno<N> spealat+ruhta<N> falástallan+ulbmil<N> 2010<Num> s<N><ABBR>
ohca<N> kulturdepartemeanta<N> siidu<N>
ohcat<V><TV> ollis<A> ráđđehus<N> no<Pcle> *as
cavgilit<V><TV> ustit<N>

Make sure your files are the same length:

$ wc -l forvaltningsordbok.tagged.clean.nob forvaltningsordbok.tagged.clean.sme
  159841 forvaltningsordbok.tagged.clean.nob
  159841 forvaltningsordbok.tagged.clean.sme
  319682 total

Align corpus

Extract lexicon

Prune with relative frequency

While waiting: This page will shorten the waiting time

Difference between revisions of "Extracting bilingual dictionaries with Giza++"

Revision as of 13:45, 7 February 2012

Contents

Get your corpus

Process corpus

Align corpus

Extract lexicon

Prune with relative frequency

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools