Ideas for Google Summer of Code/Backpropagation
This project is kind of like backpropagation but the each layer of the network/pipeline is different and this probably needs to be taken into account.
You could probably get a lot of the information by stepping through the pipeline in reverse, but there's also some need to pay attention to what each module does.
- For comparison, you'll need to analyze the reference sentence, if you get unknowns that's a target monodix issue
- If you have the right words in the wrong order, that's probably transfer
- If the correct word was present after bidix but not after lexical selection, that's the issue
- Similarly for morphological disambiguation
- A gender issue on pronouns or subject agreement on verbs might be anaphora (might have to restrict this logic to only pairs that use apertium-anaphora and words with anaphora attached to them)
- If it's structurally correct but the wrong lemma, probably bidix
- Unknown word mark? Analyzer
Also, problems may have more than one plausible solution!
Write a script that takes a bilingual corpus and runs it through the full pipeline and just up to bidix and identify words with lexical selection errors (that is, the desired lemma+POS was in the bidix output but not the final output) and print them out with a 3 words of context on either side.