Parallel corpus pruning

From Apertium
Revision as of 13:13, 18 March 2009 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

This page describes some ideas of how to prune parallel corpora to retrieve the most literally translated aligned segments. Ony of the obstacles to inducing effective transfer rules from parallel corpora is lack of literalness in translation. For example, for a given aligned segment, it may not have been translated between languages, but rather both from a single source language.

  • Una de las personas que recientemente han asesinado en Sri Lanka ha sido al Sr . Kumar Ponnambalam , quien hace pocos meses visitó el Parlamento Europeo .
  • Una delle vittime più recenti è stato Kumar Ponnambalam , che qualche mese fa era venuto in visita qui al Parlamento europeo .
  • Uma das pessoas recentemente assassinadas foi o senhor Kumar Ponnambalam, que ainda há poucos meses visitara o Parlamento Europeu.

When making rules, we're more interested in "literal" translations, so for example:

  • fo: 1 Í upphavi skapti Gud himmal og jørð.
  • is: 1 Í upphafi skapaði Guð himin og jörð.

Is good, there is a one-to-one- correspondence and the words are in the same order. If stuff is moved around for stylistic reasons, for example

  • fo: 7 Gud gjørdi tá hvølvið og skilti vatnið undir hvølvinum frá vatninum yvir hvølvinum. Og so varð.
  • is: 7 Þá gjörði Guð festinguna og greindi vötnin sem voru undir festingunni frá þeim vötnum sem voru yfir henni. Og það varð svo.

where the subject "Guð" is moved (among other differences) it isn't so good.

Possible ways of pruning

  • One of the ways would be to discard phrases which can't be produced by the MT system (presumably in terms of lemma matches).
  • Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"
  • ...