Talk:Why we trim

From Apertium
Jump to navigation Jump to search


Possible method for keeping surface forms and avoiding trimming

A possible way of dealing with keeping surface forms when splitting mwe's: put full surface on the first part, no surface on the rest. Example, assuming "magasin" is missing from bidix:

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>/vann<n><nt><sg><ind><cmp>+magasin<n><nt><pl><ind>$
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$

From here on there are two possibilities: try to keep the surface form until bidix, or try to keep it all the way until generation.

  • If we keep it until bidix, we can output the surface form if it's not in bidix, and avoid trimming monodix to bidix, but we can still end up with generation errors where bidix is contains words not in generation.dix, but it's better than nothing.
  • If we could keep it until generation, we could output the original surface form even if none of the dictionaries were trimmed.


Keeping the surface form until bidix

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$
# apertium-pretransfer would currently output this as two units, needs a slight change
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o | lt-proc -o nb-nn.autobil.bin
^vannmagasin/@vannmagasin$
$ echo vannkoker   | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o | lt-proc -o nb-nn.autobil.bin
^vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^koker<n><m><sg><ind>/kokar<n><m><sg><ind>$

This would avoid the need to trim monodix to bidix, and we'd still have surface forms for words that are not in bidix, or even partially not in bidix. We'd also avoid outputting lemmas.

Generation errors would still be a problem, but since that's a monodix it will in general be larger anyway.


Currently pretransfer does this to multiwords with both + and #:

^sf/lm<v>+lm1<n>#queue$ → ^sf/lm#queue<v>$ ^lm1<n>$
^sf/lm<v>#queue+lm1<n>$ → ^sf/lm#queue<v>$ ^lm1<n>$

We would want it to do this:

^sf/lm<v>+lm1<n>#queue$ → ^sf/lm#queue<v>+lm1<n>$
^sf/lm<v>#queue+lm1<n>$ → ^sf/lm#queue<v>+lm1<n>$

and then let lt-proc output ^sf/@sf$ if either lm#queue<v> or lm1<n> were unknown, otherwise ^lm#queue<v>/lmf#queuef<v>$ ^lm1<n>/lm1f<n>$

Keeping the surface form until generation

Say you do:

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer
^vannmagasin/vann<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>$
# Currently, apertium-pretransfer outputs ^magasin<n><nt><sg><ind>$, we'd need it to ensure an empty surface form here

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin
^*vannmagasin/vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>/@magasin<n><nt><sg><ind>$
# Currently, lt-proc -o only _accepts_ surface forms, it doesn't output them (nor does it output @analysis correctly)

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin | apertium-transfer -o apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin 
^n_n<n><nt><sg><ind>{^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$}$
# apertium-transfer would need an -o option that is able to pass through the surface form

# Interchunk should need no change, since it doesn't deal with the insides of the chunk. 
# Postchunk might need a slight change to notice and output the original surface form.

# But by generation you have a problem. Say you end up with
# ^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$
# What do you output? 
# - If you generate "vatn", you can't generate "magasin" (that's just a lemma, and outputting plain 
#   lemmas can lead to altering or even negating meaning).
# - Ideally you want to generate the original surface form "vannmagasin", but how do you know that? 
#   Transfer could have moved about the @-analysis, and in any case, how many lexical units are we 
#   talking about here? An mwe can be of unlimited length

A difficult example from Kazakh:

килмәгәнмен: кил.1SG.NEG + мен, "I didn't arrive/show up", is both a compound and a morphological negative. So, if кил had no translation, outputting the lemma alone would severely alter the meaning.