Difference between revisions of "Multi-engine translation synthesiser"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
The idea of this project is to take advantage of all possible resources in creating MT systems for marginalised languages. The general idea is to use the output of various MT systems to produce one "better" translation. The "baseline" would be to use Apertium and Moses.
The idea of this project is to take advantage of all possible resources in creating MT systems for marginalised languages. The general idea is to use the output of various MT systems to produce one "better" translation. The "baseline" would be to use Apertium and Moses.

Code at https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-combine

==Principles==

* Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.
* Integrate seamlessly into the Apertium pipeline.
* Be modular, it should work with Moses, but any other source of translations / phrases should be also considered.


==Ideas==
==Ideas==


;Statistical post-edition
===Statistical post-edition===


This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.
This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.


* The first approximation would be to
* The first approximation would be to
** Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)--English phrase table.
** Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
** Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
** Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
** If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
** If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
Line 15: Line 23:
** Could also help by resolving some unknown words.
** Could also help by resolving some unknown words.


The format would be something like:
Issues: Speed &mdash; language models and phrase tables are slow, but we can discard lots<ref>Johnson et al. (2007)</ref>

<pre>
$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2
This is a [<em>]{test|proof}[<\/em>] of the {automatic translation|machine translation} system.

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2 | apertium-ranker en.blm
This is a [<em>]test[<\/em>] of the machine translation system.
</pre>

''Issues'':

Speed &mdash; language models and phrase tables are slow, but we can discard lots<ref>Johnson et al. (2007)</ref> Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec. We can probably<ref>This is an assumption that we should probably test</ref> do with having a much smaller language model than required by SMT, partially because we have more of an idea that the output "should" be correct, regardless of the statistics. For example that the translation in the bilingual dictionary ''is already'' the most frequent or most general.

''Why this is good'':

For languages like Welsh and Breton where there isn't a lot of parallel text, but there is some, we can take advantage of it. We can also take advantage of user improvements without having to improve the language pair directly.

===Multi-engine pipeline===

This would have two or three extra programs in the pipeline. The first (<code>apertium-collector</code>) would probably sit after the deformatter and would 'collect' segments naïvely, clipping on <code><sent></code> and passing them on to the other MT engines. The second (<code>apertium-combiner</code>) would sit after post-generation and take the Apertium output and the output from the translations got with the <code>apertium-collector</code>, and synthesise the best possible translation.

;Sample

* Input sentence: Fe fyddai un o bob pedwar o rieni Cymru yn caniatáu i'w plant gael ffôn symudol cyn iddyn nhw fod yn 10 oed, yn ôl arolwg.
** Apertium: one would be of each four parents Wales allowing to its children get mobile phone before to them be in 10 years old, according to survey. (69% WER)
** Moses: i would be one in four of parents in wales to allow it to children before they get mobile phone them that is 10 years , ago arolwg. (75% WER)
* Best synthesis: <span style="background-color: #ddf">Would be</span> <span style="background-color: #cfc">one in four</span> <span style="background-color: #ddf">parents</span> <span style="background-color: #cfc">in</span> <span style="background-color: #ddf">Wales</span> <span style="background-color: #ffa">allowing to its</span> <span style="background-color: #ddf">children get mobile phone</span> <span style="background-color: #ffa">before</span> <span style="background-color: #ddf">them</span> <span style="background-color: #ffa">be</span> <span style="background-color: #ddf">10 years</span> <span style="background-color: #ffa">old, according to survey</span>.<ref>Calculated by hand</ref> (58% WER)
* Reference: One in four parents in Wales would allow their children to get a mobile phone before the age of 10, according to a survey.

The format would be something like:

<pre>
$ echo "Fe fyddai un o bob <em>pedwar</em> o rieni Cymru yn caniatáu i'w plant gael ffôn symudol cyn iddyn nhw fod yn 10 oed, yn ôl arolwg." | \
apertium-destxt | apertium-collector -e moses | lt-proc ... | cg-proc ...rest of pipeline... | lt-proc -g ... | apertium-combiner | \
apertium-ranker en.blm
</pre>


==Notes==
==Notes==
Line 26: Line 70:


[[Category:Development]]
[[Category:Development]]
[[Category:Documentation in English]]

Latest revision as of 07:04, 10 May 2012

The idea of this project is to take advantage of all possible resources in creating MT systems for marginalised languages. The general idea is to use the output of various MT systems to produce one "better" translation. The "baseline" would be to use Apertium and Moses.

Code at https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-combine

Principles[edit]

  • Make maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.
  • Integrate seamlessly into the Apertium pipeline.
  • Be modular, it should work with Moses, but any other source of translations / phrases should be also considered.

Ideas[edit]

Statistical post-edition[edit]

This idea is kind of like the TMX support in Apertium, only it goes at the end of the pipeline.

  • The first approximation would be to
    • Take a parallel corpus, for e.g. Welsh--English, then run the Welsh side through Apertium to get English(MT)—English phrase table.
    • Make a program that goes at the end of the pipeline that for n-gram segments (what SMT people call "phrases") looks them up in the phrase table.
    • If it finds a matching phrase, it scores both on a language model and chooses the highest probability.
    • This idea can be extended by incorporating user-feedback. For example a user "post-edits a phrase" and you can add these phrases to the phrase table at a given probability.
    • Could also help by resolving some unknown words.

The format would be something like:

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2 
This is a [<em>]{test|proof}[<\/em>] of the {automatic translation|machine translation} system.

$ echo "This is a [<em>]test[<\/em>] of the automatic translation system" | apertium-phrase-lookup phrase-table.0-0,2  | apertium-ranker en.blm
This is a [<em>]test[<\/em>] of the machine translation system.

Issues:

Speed — language models and phrase tables are slow, but we can discard lots[1] Apertium translates at ~3,000 words/sec, we'll want to see how we can translate at least at ~1,000 words/sec. We can probably[2] do with having a much smaller language model than required by SMT, partially because we have more of an idea that the output "should" be correct, regardless of the statistics. For example that the translation in the bilingual dictionary is already the most frequent or most general.

Why this is good:

For languages like Welsh and Breton where there isn't a lot of parallel text, but there is some, we can take advantage of it. We can also take advantage of user improvements without having to improve the language pair directly.

Multi-engine pipeline[edit]

This would have two or three extra programs in the pipeline. The first (apertium-collector) would probably sit after the deformatter and would 'collect' segments naïvely, clipping on <sent> and passing them on to the other MT engines. The second (apertium-combiner) would sit after post-generation and take the Apertium output and the output from the translations got with the apertium-collector, and synthesise the best possible translation.

Sample
  • Input sentence: Fe fyddai un o bob pedwar o rieni Cymru yn caniatáu i'w plant gael ffôn symudol cyn iddyn nhw fod yn 10 oed, yn ôl arolwg.
    • Apertium: one would be of each four parents Wales allowing to its children get mobile phone before to them be in 10 years old, according to survey. (69% WER)
    • Moses: i would be one in four of parents in wales to allow it to children before they get mobile phone them that is 10 years , ago arolwg. (75% WER)
  • Best synthesis: Would be one in four parents in Wales allowing to its children get mobile phone before them be 10 years old, according to survey.[3] (58% WER)
  • Reference: One in four parents in Wales would allow their children to get a mobile phone before the age of 10, according to a survey.

The format would be something like:

$ echo "Fe fyddai un o bob <em>pedwar</em> o rieni Cymru yn caniatáu i'w plant gael ffôn symudol cyn iddyn nhw fod yn 10 oed, yn ôl arolwg." | \ 
  apertium-destxt | apertium-collector -e moses | lt-proc ... | cg-proc ...rest of pipeline... | lt-proc -g ... | apertium-combiner | \ 
  apertium-ranker en.blm

Notes[edit]

  1. Johnson et al. (2007)
  2. This is an assumption that we should probably test
  3. Calculated by hand

References[edit]

  • Johnson, J.H., Martin, J., Foster, G., and Kuhn, R. (2007) "Improving Translation Quality by Discarding Most of the Phrasetable". Proceedings of EMNLP. 2007. NRC 49348.