Anaphora resolution
Apertium has a problem with anaphora resolution.
For example:
- If you have "el seu" in Catalan and are translating to French it could be "son" (third-person singular or "leur" (third-person plural). If you are translating to English or Russian then you also need to know the gender of the possessor (его, ее, их).
- If you are generating subject pronouns for a language, often you need to know the gender of the pronoun, e.g. "ha arribat" could be "He has arrived" or "She has arrived". In this case the "frequent" thing to do is to use the masculine pronoun, but that just relies on the male pronouns are used more frequently (see below):
Usually this kind of thing is done over parse trees, but Apertium doesn't have parse trees, so we'd need to find another way to do it.
Masculine and feminine subject pronouns in English wikipedia:
5682787 he 3469648 He 1508156 she 839442 She
Ideas
Integration with the pipeline
One idea is to allow an extra LU to be passed into transfer using the following biltrans format:
- Els grups del Parlament han mostrat aquest dimarts el seu suport al batle d'Alaró, Guillem Balboa, que va denunciar que uns desconeguts havien llançat un xai mort al pati de casa seva.
Current output of biltrans:
^El<det><def><m><pl>/The<det><def><m><pl>$ ^grup<n><m><pl>/group<n><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^Parlament<n><m><sg>/Parliament<n><sg>$ ^haver<vbhaver><pri><p3><pl>/have<vbhaver><pri><p3><pl>$ ^mostrar<vblex><pp><m><sg>/show<vblex><pp><m><sg>/display<vblex><pp><m><sg>$ ^aquest<det><dem><m><sg>/this<det><dem><m><sg>$ ^dimarts<n><m><sp>/Tuesday<n><ND>$ ^el seu<det><pos><m><sg>/his<det><pos><m><sg>$ ^suport<n><m><sg>/support<n><sg>$ ^a<pr>/at<pr>/in<pr>/to<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^*batle/*batle$ ^de<pr>/of<pr>/from<pr>$ ^*Alaró/*Alaró$^.<sent>/.<sent>$
Proposed output:
^El<det><def><m><pl>/The<det><def><m><pl>/$ ^grup<n><m><pl>/group<n><pl>$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^Parlament<n><m><sg>/Parliament<n><sg>/$ ^haver<vbhaver><pri><p3><pl>/have<vbhaver><pri><p3><pl>/$ ^mostrar<vblex><pp><m><sg>/show<vblex><pp><m><sg>/display<vblex><pp><m><sg>/$ ^aquest<det><dem><m><sg>/this<det><dem><m><sg>/$ ^dimarts<n><m><sp>/Tuesday<n><ND>/$ ^el seu<det><pos><m><sg>/his<det><pos><m><sg>/group<n><pl>$ ^suport<n><m><sg>/support<n><sg>/$ ^a<pr>/at<pr>/in<pr>/to<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^*batle/*batle/$ ^de<pr>/of<pr>/from<pr>/$ ^*Alaró/*Alaró/$^.<sent>/.<sent>/$
Note that an extra / has been added, after this / comes the "extra information", which could be the anaphoric referent. For example, at the moment the translation is:
The groups of the Parliament have shown this Tuesday his support at the *batle of *Alaró...
But if we add the information about referent:
^el seu<det><pos><m><sg>/his<det><pos><m><sg>/grup<n><m><pl>$
We could get "their" instead of "his". This extra information could be addressed in transfer using something like:
<clip pos="1" side="ref" part="a_gen"/> <clip pos="1" side="ref" part="a_nbr"/>
This would provide minimal disturbance to the existing transfer syntax:
<choose> <when> <test><equal><clip pos="1" side="ref" part="a_nbr"/><lit-tag v="pl"/></equal></test> <let><clip pos="1" side="tl" part="lem"/><lit v="their"/></let></when> <when> <!-- Should also check for animacy: el cotxe ... el seu ... → the car ... its ... --> <test><and><equal><clip pos="1" side="ref" part="a_gen"/><lit-tag v="m"/></equal></test> <let><clip pos="1" side="tl" part="lem"/><lit v="his"/></let></when> <when> <!-- Should also check for animacy: la máquina ... el seu ... → the machine ... its ... --> <test><equal><clip pos="1" side="ref" part="a_gen"/><lit-tag v="f"/></equal></test> <let><clip pos="1" side="tl" part="lem"/><lit v="her"/></let></when> </choose>
The actual information could be added either using CG or by developing another tool (e.g. using some kind of machine learning).
Machine learning component
We'd first need a way to mark possible NPs as antecedents. We could use a transfer-pattern-like file format for this, where we have e.g.
<def-cat n="det"> <cat-item tags="det.*"/> </def-cat> <def-cat n="adj"> <cat-item tags="adj.*"/> </def-cat> <def-cat n="nom"> <cat-item tags="n.*"/> </def-cat> ... <markable> <pattern> <pattern-item n="det"/> <pattern-item n="adj"/> <pattern-item n="nom" head/> </pattern> </markable>
This would then match the sequence and store the "nom" (and it's position) as the head and a potential antecedent.
Further reading
- Ruslan Mitkov (1999) "Multilingual Anaphora Resolution". Machine Translation. Volume 14, Issue 3–4, pp 281–299