Difference between revisions of "Multi-engine translation synthesiser"

From Apertium
Jump to navigation Jump to search
Line 9: Line 9:
   
 
* The first approximation would be to
 
* The first approximation would be to
** Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)--English phrase table.
+
** Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
 
** Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
 
** Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
 
** If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
 
** If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
Line 15: Line 15:
 
** Could also help by resolving some unknown words.
 
** Could also help by resolving some unknown words.
   
Issues: Speed &mdash; language models and phrase tables are slow, but we can discard lots<ref>Johnson et al. (2007)</ref>
+
Issues: Speed &mdash; language models and phrase tables are slow, but we can discard lots<ref>Johnson et al. (2007)</ref> Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec.
   
 
The format would be something like:
 
The format would be something like:

Revision as of 12:50, 29 March 2009

The idea of this project is to take advantage of all possible resources in creating MT systems for marginalised languages. The general idea is to use the output of various MT systems to produce one "better" translation. The "baseline" would be to use Apertium and Moses.

Ideas

Statistical post-edition

This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.

  • The first approximation would be to
    • Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
    • Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
    • If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
    • This idea can be extended by incorporating user-feedback. For example a user "post-edits a phrase" and you can add these phrases to the phrase table at a given probability.
    • Could also help by resolving some unknown words.

Issues: Speed — language models and phrase tables are slow, but we can discard lots[1] Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec.

The format would be something like:

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2 
This is a [<em>]{test|proof}[<\/em>] of the {automatic translation|machine translation} system.

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2  | apertium-ranker en.blm
This is a [<em>]test[<\/em>] of the machine translation system.

Notes

  1. Johnson et al. (2007)

References

  • Johnson, J.H., Martin, J., Foster, G., and Kuhn, R. (2007) "Improving Translation Quality by Discarding Most of the Phrasetable". Proceedings of EMNLP. 2007. NRC 49348.