Difference between revisions of "Talk:Why we trim"
Line 1: | Line 1: | ||
{{ |
{{TOCD}} |
||
Revision as of 16:21, 12 April 2013
Possible method for keeping surface forms and avoiding trimming
A possible way of dealing with keeping surface forms when splitting mwe's: put full surface on the first part, no surface on the rest. Example, assuming "magasin" is missing from bidix:
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin ^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>/vann<n><nt><sg><ind><cmp>+magasin<n><nt><pl><ind>$ $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob ^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$
From here on there are two possibilities: try to keep the surface form until bidix, or try to keep it all the way until generation.
- If we keep it until bidix, we can output the surface form if it's not in bidix, and avoid trimming monodix to bidix, but we can still end up with generation errors where bidix is contains words not in generation.dix, but it's better than nothing.
- If we could keep it until generation, we could output the original surface form even if none of the dictionaries were trimmed.
Keeping the surface form until bidix
We skip apertium-pretransfer, and do the job of pretransfer in lt-proc. Given a multiword, it has to output it with an @ if any one part of it is unknown, otherwise it outputs each analysis split like pretransfer would.
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin ^vannmagasin/@vannmagasin$ $ echo vannkoker | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin ^vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^koker<n><m><sg><ind>/kokar<n><m><sg><ind>$
(Perhaps a version of apertium-pretransfer should still run to deal with <g> / #, just not <j> / + mwe's.)
This would avoid the need to trim monodix to bidix, and we'd still have surface forms for words that are not in bidix, or even partially not in bidix. We'd also avoid outputting lemmas.
Generation errors would still be a problem, but since that's a monodix it will in general be larger anyway.
Keeping the surface form until generation
Say you do:
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer ^vannmagasin/vann<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>$ # Currently, apertium-pretransfer outputs ^magasin<n><nt><sg><ind>$, we'd need it to ensure an empty surface form here $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin ^*vannmagasin/vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>/@magasin<n><nt><sg><ind>$ # Currently, lt-proc -o only _accepts_ surface forms, it doesn't output them (nor does it output @analysis correctly) $ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin | apertium-transfer -o apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin ^n_n<n><nt><sg><ind>{^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$}$ # apertium-transfer would need an -o option that is able to pass through the surface form # Interchunk should need no change, since it doesn't deal with the insides of the chunk. # Postchunk might need a slight change to notice and output the original surface form. # But by generation you have a problem. Say you end up with # ^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$ # What do you output? # - If you generate "vatn", you can't generate "magasin" (that's just a lemma, and outputting plain # lemmas can lead to altering or even negating meaning). # - Ideally you want to generate the original surface form "vannmagasin", but how do you know that? # Transfer could have moved about the @-analysis, and in any case, how many lexical units are we # talking about here? An mwe can be of unlimited length
A difficult example from Kazakh:
килмәгәнмен: кил.1SG.NEG + мен, "I didn't arrive/show up", is both a compound and a morphological negative. So, if кил had no translation, outputting the lemma alone would severely alter the meaning.