Ideas for Google Summer of Code/automatic-postediting
Improving language pairs by mining MediaWiki Content Translation postedits
Implement a toolkit that allows mining existing machine translation postediting data in [Mediawiki Content Translation https://www.mediawiki.org/wiki/Content_translation] to generate (as automatically as possible, and as complete as possible) monodix and bidix entries to improve the performance of an Apertium language pair. Data is available from Wikimedia content translation through an [API https://www.mediawiki.org/wiki/Content_translation/Published_translations#API] or in the form of [Dumps https://dumps.wikimedia.org/other/contenttranslation/] available in JSON and TMX format. This project is rather experimental and involves some research in addition to coding.
The first phase would produce a set of postediting operators (s,MT(s),t) from three files: a source file (one sentence per line), a machine-translated file (one sentence per line) and a postedited file (one sentence per line).
There is code in [] which is described there. The process is described in file rationale.md there.
Study and implement ways to turn these triplets into information that can be inserted in that Apertium language pair. There are various points where it can be inserted:
- *dictionaries*: the process may discover multi-word lexical units that need to be added to improve a literal translation
- *constraint grammar rules* the process may discover words that have been incorrectly disambiguated and may be turned into a constraint grammar rule
- *lexical selection rules*: the process may discover words that have been translated in the wrong sense, and they may be turned into lexical selection rules
The idea would be to identify safe postediting rules that can clearly be analysed as being examples of one of these cases and be turned into linguistic data to be inserted (more or less automatically) into the pair.
- Understand the problem and the code.
- Prepare source, Apertium MT-translated, and reference files (perhaps from an existing corpus) for the language pair of choice. Make a training set (large) and a test set (smaller)
- Make the code in [] work to extract triplets.
- Apply these triplets to the test set, and see if there is improvement.
- This uses code by User:Pankajksharma from his GSoC project.