Difference between revisions of "Extracting bilingual dictionaries with Giza++"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:
   
 
==Get your corpus==
 
==Get your corpus==
  +
  +
Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files:
  +
  +
* <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian
  +
* <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi
  +
  +
Check to see if the files are the same length:
  +
  +
<pre>
  +
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme
  +
161837 forvaltningsordbok.nob
  +
161837 forvaltningsordbok.sme
  +
323674 total
  +
</pre>
  +
  +
If the files are not the same length, then you need to go back and check your sentence alignment.
   
 
==Process corpus==
 
==Process corpus==

Revision as of 13:39, 7 February 2012

This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.

Get your corpus

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

  • forvaltningsordbok.nob: A list of sentences in Norwegian
  • forvaltningsordbok.sme: Translations of the previous sentences in North Sámi

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

Process corpus

Align corpus

Extract lexicon

Prune with relative frequency

While waiting: This page will shorten the waiting time