Parallel corpus pruning

From Apertium
Revision as of 13:47, 18 March 2009 by Francis Tyers (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes some ideas of how to prune parallel corpora to retrieve the most literally translated aligned segments. Ony of the obstacles to inducing effective transfer rules from parallel corpora is lack of literalness in translation. For example, for a given aligned segment, it may not have been translated between languages, but rather both from a single source language.

  • Una de las personas que recientemente han asesinado en Sri Lanka ha sido al Sr . Kumar Ponnambalam , quien hace pocos meses visitó el Parlamento Europeo .
  • Una delle vittime più recenti è stato Kumar Ponnambalam , che qualche mese fa era venuto in visita qui al Parlamento europeo .
  • Uma das pessoas recentemente assassinadas foi o senhor Kumar Ponnambalam, que ainda há poucos meses visitara o Parlamento Europeu.

When making rules, we're more interested in "literal" translations, so for example:

  • fo: 1 Í upphavi skapti Gud himmal og jørð.
  • is: 1 Í upphafi skapaði Guð himin og jörð.
"In beginning-DEF created God heavens and earth"

Is good, there is a one-to-one- correspondence and the words are in the same order. If stuff is moved around for stylistic reasons, for example

  • fo: 7 Gud gjørdi tá hvølvið og skilti vatnið undir hvølvinum frá vatninum yvir hvølvinum. Og so varð.
God made then firmament-DEF and divided waters-DEF under firmament-DEF from waters-DEF above firmament-DEF: and so was.
  • is: 7 Þá gjörði Guð festinguna og greindi vötnin sem voru undir festingunni frá þeim vötnum sem voru yfir henni. Og það varð svo.
Then made God firmament-DEF and divided waters-DEF which were under firmament-DEF from waters-DEF which were above firmament-DEF: And so was so.

The subject "Guð" is moved, the Icelandic uses relatives, where the Faroese doesn't.

An equally valid, but more literal translation in Icelandic would be:

  • is: 7 Guð gjörði þá festinguna og greindi vötnin undir festingunni frá vötnum yfir henni. Og svo varð.

Which is exactly the same word order as the Faroese.

Possible ways of pruning[edit]

  • One of the ways would be to discard phrases which can't be produced by the MT system (presumably in terms of lemma matches).
  • Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"
  • Look at the fertility (between close languages lower)
  • ...