Extracting bilingual dictionaries with Giza++
Revision as of 13:39, 7 February 2012 by Francis Tyers (talk | contribs)
This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
Get your corpus
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus
Align corpus
Extract lexicon
Prune with relative frequency
While waiting: This page will shorten the waiting time