Ideas for Google Summer of Code/Improved bilingual dictionary induction

From Apertium
< Ideas for Google Summer of Code
Revision as of 10:13, 13 March 2013 by Francis Tyers (talk | contribs) (Created page with ' 3) Improved bilingual dictionary induction. Use case: you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-F…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

3) Improved bilingual dictionary induction. Use case: you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-French. You can analyse the corpus, and use some word-aligner (Giza++) to get word alignments, but you can't make the bidix entries directly from that. The user will have to specify models for bidix entries which map SL-paradigm : TL-paradigm. When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded. E.g.

fr:

     <e lm="temps">temps<par n="mois__n"/></e>

ro:

   <e lm="timp" a="mioara">timp<par n="timp__n"/></e>
   <e lm="vreme" r="LR">vrem<par n="vrem/e__n"/></e>

Let's suppose we find in the alignments:

temps:timp temps:vreme

We will need patterns to match forms in mois__n to forms in timp__n and forms in mois__n to forms in vrem/e__n .

There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. mois__n to cu__pr)

Tasks

Coding challenge

Frequently asked questions

See also