Talk:Why we trim
Jump to navigation
Jump to search
A possible way of dealing with keeping surface forms when splitting mwe's: put full surface on the first part, no surface on the rest. Example, assuming "magasin" is missing from bidix:
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin ^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>/vann<n><nt><sg><ind><cmp>+magasin<n><nt><pl><ind>$ $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob ^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$ $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer ^vannmagasin/vann<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>$ # Currently, apertium-pretransfer outputs ^magasin<n><nt><sg><ind>$, we'd need it to ensure an empty surface form here $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin ^*vannmagasin/vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>/@magasin<n><nt><sg><ind>$ # Currently, lt-proc -o only _accepts_ surface forms, it doesn't output them (nor does it output @analysis correctly) $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin | apertium-transfer -o apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin ^n_n<n><nt><sg><ind>{^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$}$ # apertium-transfer would need an -o option that is able to pass through the surface form # Interchunk should need no change, since it doesn't deal with the insides of the chunk. # Postchunk might need a slight change to notice and output the original surface form. # But by generation you have a problem. Say you end up with # ^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$ # What do you output? # - If you generate "vatn", you can't generate "magasin" (that's just a lemma, and outputting plain # lemmas can lead to altering or even negating meaning). # - Ideally you want to generate the original surface form "vannmagasin", but how do you know that? # Transfer could have moved about the @-analysis, and in any case, how many lexical units are we # talking about here? An mwe can be of unlimited length
Difficult example from Kazakh:
килмәгәнмен: кил.1SG.NEG + мен, "I didn't show up", is both a compound and a morphological negative (so outputting the lemma "come" alone severly alters the meaning)