Difference between revisions of "Multi-engine translation synthesiser"

Revision as of 17:26, 29 March 2009

Principles

Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.
Integrate seamlessly into the Apertium pipeline.
Be modular, it should work with Moses, but any other source of translations / phrases should be also considered.

Ideas

Statistical post-edition

This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.

The first approximation would be to
- Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
- Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
- If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
- This idea can be extended by incorporating user-feedback. For example a user "post-edits a phrase" and you can add these phrases to the phrase table at a given probability.
- Could also help by resolving some unknown words.

The format would be something like:

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2 
This is a [<em>]{test|proof}[<\/em>] of the {automatic translation|machine translation} system.

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2  | apertium-ranker en.blm
This is a [<em>]test[<\/em>] of the machine translation system.

Issues:

Speed — language models and phrase tables are slow, but we can discard lots^[1] Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec. We can probably do with having a much smaller language model than required by SMT, partially because we have more of an idea that the output "should" be correct, regardless of the statistics. For example that the translation in the bilingual dictionary is already the most frequent or most general.

Why this is good:

For languages like Welsh and Breton where there isn't a lot of parallel text, but there is some, we can take advantage of it. We can also take advantage of user improvements without having to improve the engine directly.

Notes

↑ Johnson et al. (2007)

References

Johnson, J.H., Martin, J., Foster, G., and Kuhn, R. (2007) "Improving Translation Quality by Discarding Most of the Phrasetable". Proceedings of EMNLP. 2007. NRC 49348.

[1] Johnson et al. (2007)

[1]

@@ Line 4: / Line 4: @@
 ==Principles==
-* Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems(?)
+* Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.
 * Integrate seamlessly into the Apertium pipeline.
 * Be modular, it should work with Moses, but any other source of translations / phrases should be also considered.

Difference between revisions of "Multi-engine translation synthesiser"

Revision as of 17:26, 29 March 2009

Contents

Principles

Ideas

Notes

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools