Multi-engine translation synthesiser

Principles

Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems(?)
Integrate seamlessly into the Apertium pipeline.
Be modular, it should work with Moses, but any other source of translations / phrases should be also considered.

Ideas

Statistical post-edition

This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.

The first approximation would be to
- Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
- Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
- If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
- This idea can be extended by incorporating user-feedback. For example a user "post-edits a phrase" and you can add these phrases to the phrase table at a given probability.
- Could also help by resolving some unknown words.

The format would be something like:

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2 
This is a [<em>]{test|proof}[<\/em>] of the {automatic translation|machine translation} system.

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2  | apertium-ranker en.blm
This is a [<em>]test[<\/em>] of the machine translation system.

Issues:

Speed — language models and phrase tables are slow, but we can discard lots^[1] Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec.

Why this is good:

For languages like Welsh and Breton where there isn't a lot of parallel text, but there is some, we can take advantage of it. We can also take advantage of user improvements without having to improve the engine directly.

Notes

↑ Johnson et al. (2007)

References

Johnson, J.H., Martin, J., Foster, G., and Kuhn, R. (2007) "Improving Translation Quality by Discarding Most of the Phrasetable". Proceedings of EMNLP. 2007. NRC 49348.

[1] Johnson et al. (2007)

[1]

Multi-engine translation synthesiser

Contents

Principles

Ideas

Notes

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools