Ideas for Google Summer of Code/Backpropagation

From Apertium
Jump to navigation Jump to search

This project is kind of like backpropagation but the each layer of the network/pipeline is different and this probably needs to be taken into account.

You could probably get a lot of the information by stepping through the pipeline in reverse, but there's also some need to pay attention to what each module does.

Some thoughts:

  • For comparison, you'll need to analyze the reference sentence, if you get unknowns that's a target monodix issue
  • If you have the right words in the wrong order, that's probably transfer
  • If the correct word was present after bidix but not after lexical selection, that's the issue
  • Similarly for morphological disambiguation
  • A gender issue on pronouns or subject agreement on verbs might be anaphora (might have to restrict this logic to only pairs that use apertium-anaphora and words with anaphora attached to them)
  • If it's structurally correct but the wrong lemma, probably bidix
  • Unknown word mark? Analyzer

Also, problems may have more than one plausible solution!

Coding Challenge

Write a script that takes a bilingual corpus and runs it through the full pipeline and just up to bidix and identify words with lexical selection errors (that is, the desired lemma+POS was in the bidix output but not the final output) and print them out with a 3 words of context on either side.