Difference between revisions of "Ideas for Google Summer of Code/Improved bilingual dictionary induction"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
The problem is that you can't make the bidix entries directly from that. The user will have to specify templates for bidix entries (see for example [[bilingual dictionary]]) which map SL-paradigm : TL-paradigm.
 
The problem is that you can't make the bidix entries directly from that. The user will have to specify templates for bidix entries (see for example [[bilingual dictionary]]) which map SL-paradigm : TL-paradigm.
   
When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded. E.g.
+
When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded.
   
 
;Example
 
;Example
Line 42: Line 42:
 
Let's suppose we find in the alignments:
 
Let's suppose we find in the alignments:
   
  +
<pre>
 
temps:timp
 
temps:timp
 
temps:vreme
 
temps:vreme
  +
</pre>
   
 
We will need patterns to match forms in <code>mois__n</code> to forms in <code>timp__n</code> and forms in <code>mois__n</code> to forms in <code>vrem/e__n</code> . For example:
 
We will need patterns to match forms in <code>mois__n</code> to forms in <code>timp__n</code> and forms in <code>mois__n</code> to forms in <code>vrem/e__n</code> . For example:
Line 61: Line 63:
   
   
There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. mois__n to cu__pr)
+
There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. <code>mois__n</code> to <code>cu__pr</code>)
   
 
==Tasks==
 
==Tasks==
  +
  +
<!--
  +
  +
* Adjust the scripts to allow words which are aligned to more than a certain number of words to be excluded as translations
  +
  +
-->
   
 
==Coding challenge==
 
==Coding challenge==
   
 
* Install [[Apertium]]
 
* Install [[Apertium]]
* Install [[GIZA++]]
+
* Install [[GIZA++]] (or other word alignment tool.. Robert Östling has something)
 
* Generate a word alignment model for a parallel corpus of your choice.
 
* Generate a word alignment model for a parallel corpus of your choice.
* Rewrite the script [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-forms-server/scripts/generate-bidix-templates.py generate-bidix-templates.py] to use python3/ElementTree.
+
* <s>Rewrite the script [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-forms-server/scripts/generate-bidix-templates.py generate-bidix-templates.py] to use python3/ElementTree.</s>
  +
** Done here: [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/generate-bidix-templates.py generate-bidix-templates.py]
   
 
==Frequently asked questions==
 
==Frequently asked questions==
  +
* none yet, ''[[contact|ask us]] something!'' :)
   
 
==See also==
 
==See also==

Latest revision as of 11:05, 27 September 2016

Imagine you have two morphological analysers, but no bilingual dictionary. But, you have a parallel corpus. For example: Romanian-French (a French--Romanian parallel corpus can be generated from EuroParl. You can analyse the corpus with a morphological analyser, and use some word-aligner (e.g. GIZA++) to get word alignments.

The problem is that you can't make the bidix entries directly from that. The user will have to specify templates for bidix entries (see for example bilingual dictionary) which map SL-paradigm : TL-paradigm.

When building the bilingual dictionary, any alignment for which the SL word's paradigm doesn't have a template with the TL word's paradigm will be discarded.

Example

French dictionary:

      <e lm="temps"><i>temps</i><par n="mois__n"/></e>

temps:temps<n><m><sp>

Romanian dictionary:

    <e lm="timp" a="mioara"><i>timp</i><par n="timp__n"/></e>
    <e lm="vreme" r="LR"><i>vrem</i><par n="vrem/e__n"/></e>

timp:timp<n><nt><sg><nom><ind>
timpul:timp<n><nt><sg><nom><def>
timp:timp<n><nt><sg><dg><ind>
timpului:timp<n><nt><sg><dg><def>
timpuri:timp<n><nt><pl><nom><ind>
timpurile:timp<n><nt><pl><nom><def>
timpuri:timp<n><nt><pl><dg><ind>
timpurilor:timp<n><nt><pl><dg><def>

vreme:vreme<n><f><sg><nom><ind>
vremea:vreme<n><f><sg><nom><def>
vremi:vreme<n><f><sg><dg><ind>
vremii:vreme<n><f><sg><dg><def>
vremi:vreme<n><f><pl><nom><ind>
vremile:vreme<n><f><pl><nom><def>
vremi:vreme<n><f><pl><dg><ind>
vremilor:vreme<n><f><pl><dg><def>

Let's suppose we find in the alignments:

temps:timp
temps:vreme

We will need patterns to match forms in mois__n to forms in timp__n and forms in mois__n to forms in vrem/e__n . For example:


<e r="LR"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>timp<s n="n"/><s n="nt"/><s n="ND"/></r></p></e>
<e r="RL"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>timp<s n="n"/><s n="nt"/><s n="sg"/></r></p></e>
<e r="RL"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>timp<s n="n"/><s n="nt"/><s n="pl"/></r></p></e>


<e r="LR"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>vreme<s n="n"/><s n="f"/><s n="ND"/></r></p></e>
<e r="RL"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>vreme<s n="n"/><s n="f"/><s n="sg"/></r></p></e>
<e r="RL"><p><l>temps<s n="n"/><s n="m"/><s n="sp"/></l><r>vreme<s n="n"/><s n="f"/><s n="pl"/></r></p></e>


There will be a script to extract the most frequent combinations of paradigms in SL-TL, so the user can prioritise which templates to make. So, generating the bidix would be done in an incremental fashion. A lot of the noise of the alignment process can be filtered out by disallowing combinations of words because of no existing paradigm-paradigm model (e.g. mois__n to cu__pr)

Tasks[edit]

Coding challenge[edit]

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]