Difference between revisions of "Talk:Parallel corpus pruning"

From Apertium
Jump to navigation Jump to search
(stuff)
m (e.g.)
Line 7: Line 7:
 
* Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"
 
* Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"
   
It might be better to first consider P(e|f) of the POS alignments, before pruning based on ratio, to not discard lexicalised phrases. -- [[User:Jimregan|Jimregan]] 15:45, 18 March 2009 (UTC)
+
It might be better to first consider P(e|f) of the POS alignments, before pruning based on ratio, to not discard lexicalised phrases.
  +
  +
For example: 'copula noun le+prn.obj' -> 'prn.subj verb' in Irish->English would align 1-0 2-0 3-0 4-1, but would have a very high frequency. -- [[User:Jimregan|Jimregan]] 16:14, 18 March 2009 (UTC)

Revision as of 16:14, 18 March 2009

  • One of the ways would be to discard phrases which can't be produced by the MT system (presumably in terms of lemma matches).

Surely the easiest way to determine this is to invoke the MT system?

Alternatively, apertium-transfer-tools has a mechanism for pruning alignments based on lemma matches (along with a means of specifying stop words)

  • Look at the ratio of unaligned words. The higher the ratio, the more likely the translation is "freer"

It might be better to first consider P(e|f) of the POS alignments, before pruning based on ratio, to not discard lexicalised phrases.

For example: 'copula noun le+prn.obj' -> 'prn.subj verb' in Irish->English would align 1-0 2-0 3-0 4-1, but would have a very high frequency. -- Jimregan 16:14, 18 March 2009 (UTC)