Ideas for Google Summer of Code/Improved bilingual dictionary induction
Imagine you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-French. You can analyse the corpus, and use some word-aligner (Giza++) to get word alignments, but you can't make the bidix entries directly from that. The user will have to specify models for bidix entries which map SL-paradigm : TL-paradigm. When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded. E.g.
<e lm="temps">temps<par n="mois__n"/></e>
<e lm="timp" a="mioara">timp<par n="timp__n"/></e> <e lm="vreme" r="LR">vrem<par n="vrem/e__n"/></e>
Let's suppose we find in the alignments:
We will need patterns to match forms in mois__n to forms in timp__n and forms in mois__n to forms in vrem/e__n .
There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. mois__n to cu__pr)