User:Khannatanmai/GSoC2019Report

From Apertium
Jump to navigation Jump to search

Anaphora Resolution in Apertium[edit]

For this project , I coded a new module for the Apertium Machine Translation Pipeline. The purpose of this module is to do Anaphora Resolution in the source text so that we can get a better translation output.

Documentation[edit]

I have prepared a verbose documentation for this module which explains what Anaphora Resolution means, it's role in the Apertium pipeline, the algorithm used, how to use the module and how to add the module to a new language pair.

You can find this here: Anaphora Resolution Module

Final Module[edit]

Here is the final released version of the Anaphora Module. Follow the instructions in the README to install this module on your system and run it.

Repository of Module: https://github.com/apertium/apertium-anaphora

Work Done during GSoC 2019[edit]

The above module is the final version of the Anaphora Resolution module, complete with a build system.

You can refer to the issues to see implementation of features and comments as they were being implemented during GSoC: https://github.com/apertium/apertium-anaphora/issues

During GSoC, this repository was used: https://github.com/khannatanmai/apertium-anaphora (This repository is just to highlight the peripheral work done during GSoC)

Links to Work Done[edit]

Code for the module[edit]

Main Repository: https://github.com/apertium/apertium-anaphora

Changes made to Apertium Code[edit]

To accomodate for the new module in the pipeline.

Pull Request: https://github.com/apertium/apertium/pull/55

Changes made to Apertium Spanish-English Language Pair[edit]

To test the new module on the Spanish-English language pair.

Pull Request: https://github.com/apertium/apertium-eng-spa/pull/13

List of Commits made during GSoC to the above repositories[edit]

https://apertium.projectjj.com/gsoc2019/khannatanmai/khannatanmai.html

Evaluation[edit]

The Anaphora Resolution module was tested on multiple languages with some basic indicators. I'll be presenting the results of the evaluation, which was done manually.

Spanish - English[edit]

Spanish has a possessive determiner su, which can translate to his/her/its in English, so we need to resolve it as an anaphor.

The Anaphora Resolution Module was run on a corpus of a 1000 sentences from Europarl, using this arx file

Out of these 1000 sentences, 258 sentences had at least one possessive determiner. The translations of these sentences with and without the Anaphora Resolution module in the pipeline were evaluated comparatively. The results are as follows:

Results[edit]

  • No Change, Correct: Anaphora Resolution didn't change the anaphor and it is correct.
  • No Change, Incorrect: Anaphora Resolution didn't change the anaphor, and it is incorrect, i.e. it should have changed.
  • Change, Correct: Anaphora Resolution changed the anaphor and it is now correct (was incorrect earlier).
  • Change, Incorrect: Anaphora Resolution changed the anaphor and it is now incorrect. (was correct earlier)
No Change Change
Correct Incorrect Correct Incorrect
33 53 32 2

Number of anaphors translated correctly without the Anaphora Resolution module and with:

Total 3rd Person Anaphors Without Anaphora Resolution With Anaphora Resolution
Correct Correct
120 35 65

Accuracy of Anaphora Resolution with the module on Spa-Eng: 54.17%

Accuracy of Anaphora Resolution without the module on Spa-Eng: 29.17%


Note: Out of 258 sentences, 120 sentences had third person pronouns. The rest had first or second person pronouns which were anyway being translated correctly and are largely out of the scope of this module.

Observations[edit]

  • A lot of the errors are made because the tagger gives the singular tag to group nouns such as Parliament, Commission, Group. If this is fixed, the results should improve significantly.
  • Since the module only outputs his/her/their right now, all the examples with its haven't been resolved. Adding this would improve the results as well.
  • The indicators one uses are corpus dependent. This corpus has a dialogue and hence we added an impeding indicator to patterns such as: <NP> <comma>, as that NP is usually the addressee.

For detailed observations, refer to the Complete Evaluation

Catalan - Italian[edit]

A corpus was created from a freely available journal, and random paragraphs were analysed.

In total, 108 cases of anaphora for the 3rd person possessive determiner in Catalan when translating it to Italian were analysed. What matters in this case is the number of the referent, but not his/her/its gender. Without anaphora, the referent is always chosen to be singular.

Results[edit]

  • No Change, Correct: Anaphora Resolution didn't change the anaphor and it is correct.
  • No Change, Incorrect: Anaphora Resolution didn't change the anaphor, and it is incorrect, i.e. it should have changed.
  • Change, Correct: Anaphora Resolution changed the anaphor and it is now correct (was incorrect earlier).
  • Change, Incorrect: Anaphora Resolution changed the anaphor and it is now incorrect. (was correct earlier)
No Change Change
Correct Incorrect Correct Incorrect
76 13 5 14

Number of anaphors translated correctly without the Anaphora Resolution module and with:

Total 3rd Person Anaphors Without Anaphora Resolution With Anaphora Resolution
Correct Correct
108 90 81

Observations[edit]

  • In this corpus, just choosing singular gives correct translations in 90/108 examples so the anaphors aren't evenly spread out.
  • While the Anaphora Resolution module gives worse results here, the configurations can be tuned to give much better results for this corpus.

For detailed observations, refer to the Complete Evaluation and go to Catalan-Italian.

Future Ideas[edit]

There are several future ideas to make this module better that I'll be trying my hand on after GSoC.

These are all mentioned in the Issues in the Module Repository: https://github.com/apertium/apertium-anaphora/issues .