Talk:Why we trim

From Apertium
Jump to navigation Jump to search


Possible method for keeping surface forms and avoiding trimming

A possible way of dealing with keeping surface forms when splitting mwe's: put full surface on the first part, no surface on the rest. Example, assuming "magasin" is missing from bidix:

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>/vann<n><nt><sg><ind><cmp>+magasin<n><nt><pl><ind>$
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$

From here on there are two possibilities: try to keep the surface form until bidix, or try to keep it all the way until generation.

  • If we keep it until bidix, we can output the surface form if it's not in bidix, and avoid trimming monodix to bidix, but we can still end up with generation errors where bidix is contains words not in generation.dix, but it's better than nothing.
  • If we could keep it until generation, we could output the original surface form even if none of the dictionaries were trimmed.


Keeping the surface form until bidix

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o
^vannmagasin/vann<n><nt><sg><ind><cmp>+magasin<n><nt><sg><ind>$
# apertium-pretransfer would currently output this as two units, needs a slight change
$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o | lt-proc -o nb-nn.autobil.bin
^vannmagasin/@vannmagasin$
$ echo vannkoker   | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer -o | lt-proc -o nb-nn.autobil.bin
^vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^koker<n><m><sg><ind>/kokar<n><m><sg><ind>$

This would avoid the need to trim monodix to bidix, and we'd still have surface forms for words that are not in bidix, or even partially not in bidix. We'd also avoid outputting lemmas.

Generation errors would still be a problem, but since that's a monodix it will in general be larger anyway.


Currently pretransfer does this to multiwords with both + and #:

^sf/lm<v>+lm1<n>#queue$ → ^sf/lm#queue<v>$ ^lm1<n>$
^sf/lm<v>#queue+lm1<n>$ → ^sf/lm#queue<v>$ ^lm1<n>$

We would want it to do this:

^sf/lm<v>+lm1<n>#queue$ → ^sf/lm#queue<v>+lm1<n>$
^sf/lm<v>#queue+lm1<n>$ → ^sf/lm#queue<v>+lm1<n>$

and then let lt-proc output ^sf/@sf$ if either lm#queue<v> or lm1<n> were unknown, otherwise ^lm#queue<v>/lmf#queuef<v>$ ^lm1<n>/lm1f<n>$

Keeping the surface form until generation

Say you do:

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer
^vannmagasin/vann<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>$
# Currently, apertium-pretransfer outputs ^magasin<n><nt><sg><ind>$, we'd need it to ensure an empty surface form here

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin
^*vannmagasin/vann<n><nt><sg><ind><cmp>/vatn<n><nt><sg><ind><cmp>$ ^/magasin<n><nt><sg><ind>/@magasin<n><nt><sg><ind>$
# Currently, lt-proc -o only _accepts_ surface forms, it doesn't output them (nor does it output @analysis correctly)

$ echo vannmagasin | lt-proc -we nb-nn.automorf.bin | apertium-tagger -gp nb-nn.prob | apertium-pretransfer | lt-proc -o nb-nn.autobil.bin | apertium-transfer -o apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin 
^n_n<n><nt><sg><ind>{^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$}$
# apertium-transfer would need an -o option that is able to pass through the surface form

# Interchunk should need no change, since it doesn't deal with the insides of the chunk. 
# Postchunk might need a slight change to notice and output the original surface form.

# But by generation you have a problem. Say you end up with
# ^vannmagasin/vatn<n><nt><sg><ind><cmp>$^/@magasin<n><nt><sg><ind>$
# What do you output? 
# - If you generate "vatn", you can't generate "magasin" (that's just a lemma, and outputting plain 
#   lemmas can lead to altering or even negating meaning).
# - Ideally you want to generate the original surface form "vannmagasin", but how do you know that? 
#   Transfer could have moved about the @-analysis, and in any case, how many lexical units are we 
#   talking about here? An mwe can be of unlimited length

A difficult example from Kazakh:

килмәгәнмен: кил.1SG.NEG + мен, "I didn't arrive/show up", is both a compound and a morphological negative. So, if кил had no translation, outputting the lemma alone would severely alter the meaning.


Discussion on compounding, possible tagging solution

[08:23:00] <Unhammer> khannatanmai,  ls modes/@*
[08:23:22] <Unhammer> the debug modes starting with @ in them refer to an untrimmed analyser
[08:23:51] <Unhammer> you may have to provide the analyser yourself, but that's just copying it from the monolingual dir
[08:26:46] <Unhammer> https://i.imgur.com/4XrftuJ.png
[08:28:02] <Unhammer> https://i.imgur.com/gfMO0s2.png
[08:31:45] <Unhammer> (As the examples show, compounds are a challenge. If both analyser and bidix has boazoguohtun and kommišuvdna, but only analyser has boazoguohtunkommišuvdna, then trimmed will translate it as parts and give an actual translation, whereas untrimmed will give @sourcelemma)
[08:37:57] <khannatanmai> Unhammer yeah compounds is the only part I'm still not sure about. One way is to still let partially unknown compounds be trimmed
[08:39:26] <khannatanmai> maybe the partially known part of a compound can be useful?
[08:45:54] <khannatanmai> > (As the examples show, compounds are a challenge. If both analyser and bidix has boazoguohtun and kommišuvdna, but only analyser has boazoguohtunkommišuvdna, then trimmed will translate it as parts and give an actual translation, whereas untrimmed will give @sourcelemma)
[08:45:54] <khannatanmai> In this case,  there wont be any trimming right?
[08:46:41] <khannatanmai> boazoguohtunkommišuvdna will be analysed as boazoguohtun+kommišuvdna, since it's in the monodix, and since both the parts are in the bidix, it wont be trimmed and will translate as parts
[08:46:52] <khannatanmai> I don't see how trimming changes anything here
[09:32:07] <Unhammer> khannatanmai, without trimming, boazoguohtunkommišuvdna will be analysed as ^boazoguohtunkommišuvdna<n>$
[09:33:08] <Unhammer> Maybe there's a smart way to work around it, I haven't looked deeply into it, but it seems like a real challenge.
[09:36:44] <khannatanmai> wouldnt the monodix identify it as a compound?
[09:36:53] <Unhammer> no
[09:37:30] <khannatanmai> I thought boazoguohtunkommišuvdna is analysed as boazoguohtun<xyz>+kommišuvdna<xyz>
[09:37:51] <khannatanmai> (whatever the lemmas are)
[09:38:06] <Unhammer> in the untrimmed, you get boazoguohtunkommišuvdna<n><sem_org><sg><nom>
[09:38:20] <Unhammer> in the trimmed, you get boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
[09:39:30] <khannatanmai> Alright but then: boazoguohtunkommišuvdna<n><sem_org><sg><nom> these sort of entries shouldnt be in the monodix anyway right? If the word is a compound why is there an entry analysing it as one LU
[09:41:38] <khannatanmai> In the untrimmed dictionary we have both:
[09:41:39] <khannatanmai> boazoguohtunkommišuvdna -> boazoguohtunkommišuvdna<n><sem_org><sg><nom>
[09:41:39] <khannatanmai> boazoguohtunkommišuvdna -> boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
[09:41:41] <khannatanmai> right?
[09:42:01] <khannatanmai> trimming trims the first one since the whole compound wont be in the bidix
[09:42:10] <Unhammer> Dynamic compound analyses are generally more unsafe, so only used if we can't find a full analysis
[09:42:19] <Unhammer> That's why we include both.
[09:43:04] <khannatanmai> ah ok I wasnt aware of that
[09:44:43] <Unhammer> (e.g. on the form "nyrestaurert" (newly restored), I once saw the analysis "nyre+staur+ert" (kidney+stick+pea), until I added newyly-restored to the dixen (and turned off dynamic compounding for the rare word "staur", just in case))
[09:47:28] <khannatanmai> whats dynamic compounding again?
[09:51:54] <Unhammer> When a form doesn't get a regular analysis (using entries and paradigms etc), it would normally be output as *unknownform. But we can retry by splitting at all points of the string to see if it's analysable as two parts (where each part is analysable in the regular way, but the analyses must have certain special tags that say they're ok to be used in compounds)
[09:52:38] <Unhammer> If two parts doesn't work, we try three, four
[09:52:46] <khannatanmai> okay I didn't even know this existed :p
[09:53:02] <khannatanmai> that's interesting
[09:53:14] <khannatanmai> and now another problem for eliminating trimming
[09:55:20] <Unhammer> (at least that's how lt-proc does it; in hfst that system is normally encoded in the fst with an arc from final to initial and a flag diacritic to restrict analyses, and a higher weight so they're down-prioritised, the effect is the same)
[09:57:01] <khannatanmai> I was of the opinion that compounds are only analysed as such if the monodix says they're compounds. Didn't know that we try out possible compounds
[09:58:22] <khannatanmai> My question is, dynamic compounding happens on unknown words, and if a compound is known, it splits them. What then is the use of boazoguohtunkommišuvdna<n><sem_org><sg><nom> ?
[09:58:56] <khannatanmai> Because dynamic compounding could give an erroneous compound?
[10:06:55] <khannatanmai> Alright I see the problem. Will think about it

[10:55:44] <Unhammer> <+khannatanmai> Because dynamic compounding could give an erroneous compound?
[10:55:47] <Unhammer> Yes
[10:56:26] <Unhammer> But also, it might have a different translation than the compositional one
[10:57:18] <khannatanmai> but even if the compositional one is wrong, you would've already trimmed out the non-compositional one right
[10:57:38] <khannatanmai> Do you think there's a solution which involves keeping both of them and making a decision later in the pipeline
[10:58:09] <Unhammer> Could be
[10:58:43] <khannatanmai> I was trying hard to think of solutions which decouple the source from the target but I guess in cases like these we cant avoid a bidix dependency
[11:00:44] <Unhammer> yeah, the immediate issue is that we would want bidix info in order to make the least bad translation
[11:00:45] <Unhammer> but disambuation happens before bidix
[11:03:10] <khannatanmai> yeah. Philosophically, disambiguation should handle these things and not the bidix, but even in the discussion with Francis we thought that using the bidix is far more practical than making the disambiguation better
[11:03:40] <khannatanmai> Cause ideally, what we can translate and what we can't, shouldn't affect source analysis decisions
[11:06:22] <Unhammer> Given some ambig analysis AB/A<tags1>+B<tags2>/AB<tags3>, if we pick AB<tags3> in cg, we might end up with an @AB (if AB is not in bidix) where A+B would have given an actual translation. But if AB *is* in bidix, and we pick A+B, we might get a worse translation than we could have (e.g. "fotballpøbel" turning into "football punk" instead of "hooligan") if we knew that AB was in bidix.
[11:07:18] <Unhammer> Could it be a possibility to *tag* all analyses that are (not) in bidix?
[11:07:40] <khannatanmai> Yeah Francis was talking about weighing analyses based on the bidix
[11:07:46] <Unhammer> exactly

[11:08:08] <khannatanmai> it's still a dependency, so philosophically similar to trimming, but I guess it's better than throwing the analyses away
[11:09:16] <Unhammer> The existing trimming apparatus could probably be fairly easily used for that – currently, it slices off that part of the analyser, so you have fst's UNTRIMMED and TRIMMED, you could do UNTRIMMED - TRIMMED = NOTINBIDIX, and put a tag/weight on the last one and then union. Or something like that, many ways to do that.
[11:11:00] <khannatanmai> yup that could work. in a few cases we can use the tags to effectively emulate trimming, and in other cases we can use the analyses from LUs that aren't in the bidix
[11:13:52] <khannatanmai> simultaneously we implement the stream modification to have secondary tags and surface form, and overall we can keep the benefits of trimming and remove the disadvantages

http://tinodidriksen.com/pisg/freenode/logs/%23apertium/2020-05-03.log

https://sourceforge.net/p/apertium/tickets/88/