Talk:Why we trim

From Apertium
Revision as of 12:58, 12 April 2013 by Unhammer (talk | contribs)
Jump to navigation Jump to search

A possible way of dealing with keeping surface forms when splitting mwe's: put full surface on the first part, no surface on the rest. Example, assuming "magasin" is missing from bidix:

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>/vann<n><nt><sg><ind><cmp>+magasin<n><nt><pl><ind>$
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer
^vannmagasin/vann<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>$
# Currently, apertium-pretransfer outputs ^magasin<n><nt><sg><ind>$, we'd need it to ensure an empty surface form here

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin
^*vannmagasin/vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>/@magasin<n><nt><sg><ind>$
# Currently, lt-proc -o only _accepts_ surface forms, it doesn't output them (nor does it output @analysis correctly)

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin | apertium-transfer -o apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin 
^n_n<n><nt><sg><ind>{^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$}$
# apertium-transfer would need an -o option that is able to pass through the surface form

# Interchunk should need no change, since it doesn't deal with the insides of the chunk. 
# Postchunk might need a slight change to notice and output the original surface form.

# But by generation you have a problem. Say you end up with
# ^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$
# What do you output? 
# - If you generate "vatn", you can't generate "magasin" (that's just a lemma, and outputting plain 
#   lemmas can lead to altering or even negating meaning).
# - Ideally you want to generate the original surface form "vannmagasin", but how do you know that? 
#   Transfer could have moved about the @-analysis, and in any case, how many lexical units are we 
#   talking about here? An mwe can be of unlimited length

Difficult example from Kazakh:

килмәгәнмен: кил.1SG.NEG + мен, "I didn't arrive/show up", is both a compound and a morphological negative. So, if кил had no translation, outputting the lemma alone would severely alter the meaning.