Difference between revisions of "Ideas for Google Summer of Code/Improved bilingual dictionary induction"

From Apertium
Jump to navigation Jump to search
(Created page with ' 3) Improved bilingual dictionary induction. Use case: you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-F…')
 
Line 1: Line 1:
  +
{{TOCD}}
   
  +
Imagine you have two
3) Improved bilingual dictionary induction. Use case: you have two
 
 
morphological analysers, but no bilingual dictionary. But, you have a
 
morphological analysers, but no bilingual dictionary. But, you have a
 
parallel corpus. For example: Romanian-French. You can analyse the
 
parallel corpus. For example: Romanian-French. You can analyse the

Revision as of 10:13, 13 March 2013

Imagine you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-French. You can analyse the corpus, and use some word-aligner (Giza++) to get word alignments, but you can't make the bidix entries directly from that. The user will have to specify models for bidix entries which map SL-paradigm : TL-paradigm. When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded. E.g.

fr:

     <e lm="temps">temps<par n="mois__n"/></e>

ro:

   <e lm="timp" a="mioara">timp<par n="timp__n"/></e>
   <e lm="vreme" r="LR">vrem<par n="vrem/e__n"/></e>

Let's suppose we find in the alignments:

temps:timp temps:vreme

We will need patterns to match forms in mois__n to forms in timp__n and forms in mois__n to forms in vrem/e__n .

There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. mois__n to cu__pr)

Tasks

Coding challenge

Frequently asked questions

See also