Talk:Parallel corpus pruning

From Apertium
Revision as of 16:03, 20 March 2009 by Jimregan (talk | contribs) (you want papers, you got papers)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  • One of the ways would be to discard phrases which can't be produced by the MT system (presumably in terms of lemma matches).

Surely the easiest way to determine this is to invoke the MT system?

Alternatively, apertium-transfer-tools has a mechanism for pruning alignments based on lemma matches (along with a means of specifying stop words)

Yeah, I think that is what Felipe was talking about. If you can write more about it, do... save me cracking out the papers. ;) - Francis Tyers 17:05, 18 March 2009 (UTC)
Felipe Sánchez-Martínez. Using unsupervised corpus-based methods to build rule-based machine translation systems. PhD thesis, June 2008, Departament de Llenguatges i Sistemes Infomàtics, Universitat d'Alacant, Spain. PDF
Chapter 5, particularly 'TL Restrictions'
Felipe Sánchez-Martínez, Mikel L. Forcada. Automatic induction of shallow-transfer rules for open-source machine translation. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007), p. 181-190, September 7-9, 2007, Skövde, Sweden. PDF
See 'Filtering of the Alignment Templates' in Section 5.
Enough? -- Jimregan 16:03, 20 March 2009 (UTC)
  • Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"

It might be better to first consider P(e|f) of the POS alignments, before pruning based on ratio, to not discard lexicalised phrases.

For example: 'copula noun le+prn.obj' -> 'prn.subj verb' in Irish->English would align 1-0 2-0 3-0 4-1, but would have a very high frequency. -- Jimregan 16:14, 18 March 2009 (UTC)

It's a good idea to take into account frequency too. - Francis Tyers 17:05, 18 March 2009 (UTC)