Difference between revisions of "User talk:Khannatanmai/GSoC2020Proposal Trimming"
Jump to navigation
Jump to search
Khannatanmai (talk | contribs) (Created page with "khannatanmai i just cant think of a lot of uses of the surface form for a lot of the programs TinoDidriksen You should see how much secondary information VISL's streams have. ...") |
Khannatanmai (talk | contribs) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
28 March 2020 |
|||
<pre> |
|||
khannatanmai |
khannatanmai |
||
i just cant think of a lot of uses of the surface form for a lot of the programs |
i just cant think of a lot of uses of the surface form for a lot of the programs |
||
Line 142: | Line 144: | ||
TinoDidriksen |
TinoDidriksen |
||
It's absolutely a lot of work, but it is doable. |
It's absolutely a lot of work, but it is doable. |
||
</pre> |
|||
29 March 2020 |
|||
<pre> |
|||
xavivars |
|||
khannatanmai I guess that if we don't allow adding data to the dix, we're constraining a lot the data that will be in the stream, having a super flexible format that works carries pretty much the same information that it carries now |
|||
The important piece here is to make sure that is clear that all the this data is optional in the dix |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
yeah plus a lot of the secondary information can be information added manually, so not allowing data to the dix would make this considerably less powerful |
|||
yup |
|||
plus even with the modification of the parsers the current format should still work properly, which is why everything else will be secondary and optional |
|||
xavivars |
|||
Because that's what I think it was Hèctor's main concern |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
yeah I hope I made that clear in the reply |
|||
I think what he was also saying was that I will add info in the dix for my task and that wont be relevant for everyone else who uses that dix, so there will be a lot of redundant info |
|||
TinoDidriksen |
|||
I wasn't thinking any of the secondary information goes in any dix files, but I guess it could... |
|||
Initially, nothing goes into any dixes. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
yeah initially it's just about what the programs can put in the stream |
|||
xavivars |
|||
Definitely not for trimming |
|||
spectei |
|||
i think it might make sense to come up with a list of desiderata |
|||
before trying to come up with formats |
|||
e.g. should it be human editable / sed+awk+greppable |
|||
TinoDidriksen |
|||
See, this is where the current constraints affect thinking. I don't think Hector can imagine this isn't dynamic stream information, so he's worried all the dixes will change. |
|||
spectei |
|||
what kind of information might we want to include? |
|||
xavivars |
|||
But examples like the ones you mention about sentiment, etc... They'll need to come from "somewhere" (being "somewhere" a dix or a data file for a future module in the pipeline) |
|||
TinoDidriksen |
|||
Initially: Surface form, markup tags, dependency. Things that do not go into dixes. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
yeah sentiment was a future example, but stuff like surface form is already in the dix |
|||
in my head the info can come either from data files or the programs |
|||
TinoDidriksen |
|||
Sure |
|||
xavivars |
|||
spectei, don't you think the proposed format is sed/awk/greppable? |
|||
TinoDidriksen |
|||
It's absolutely greppable. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
in fact i was thinking about the "isn't" "ain't" example, and it might sound as not a big deal, but in the dix we can distinguish them with secondary information |
|||
xavivars |
|||
The think I liked the most when I saw the proposal in the email was that it felt that "nothing really changes" |
|||
TinoDidriksen |
|||
Which is true - nothing really changes, until people want to make use of it. It's opt-in, but the power to do so is a major overhaul. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
yep. plus if we're going to be modifying the parsers of all the programs to include surface form, it is a good idea to make it so that we can include arbitrary info |
|||
rather than modify all parsers tomorrow again |
|||
TinoDidriksen |
|||
Definitely |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
even when people modify the dix to include secondary information, it won't affect people who just use the primary info in their tasks |
|||
xavivars |
|||
Sorry, my point was that even if people uses it, "it looks the same": same format for tags, etc. Except that the concept of secondary (and because of that, optional) information gives huge opportunities |
|||
</pre> |
|||
Dynamic Compounding 3 May 2020 |
|||
<pre> |
|||
heheh :) |
|||
khannatanmai, without trimming, boazoguohtunkommišuvdna will be analysed as ^boazoguohtunkommišuvdna<n>$ |
|||
Maybe there's a smart way to work around it, I haven't looked deeply into it, but it seems like a real challenge. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
wouldnt the monodix identify it as a compound? |
|||
Unhammer |
|||
no |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
I thought boazoguohtunkommišuvdna is analysed as boazoguohtun<xyz>+kommišuvdna<xyz> |
|||
(whatever the lemmas are) |
|||
Unhammer |
|||
only in the untrimmed dix |
|||
sorry, trimmed |
|||
in the untrimmed, you get boazoguohtunkommišuvdna<n><sem_org><sg><nom> |
|||
in the trimmed, you get boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom> |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
Alright but then: boazoguohtunkommišuvdna<n><sem_org><sg><nom> these sort of entries shouldnt be in the monodix anyway right? If the word is a compound why is there an entry analysing it as one LU |
|||
In the untrimmed dictionary we have both: |
|||
boazoguohtunkommišuvdna -> boazoguohtunkommišuvdna<n><sem_org><sg><nom> |
|||
boazoguohtunkommišuvdna -> boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom> |
|||
right? |
|||
trimming trims the first one since the whole compound wont be in the bidix |
|||
Unhammer |
|||
Dynamic compound analyses are generally more unsafe, so only used if we can't find a full analysis |
|||
That's why we include both. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
ah ok I wasnt aware of that |
|||
Unhammer |
|||
(e.g. on the form "nyrestaurert" (newly restored), I once saw the analysis "nyre+staur+ert" (kidney+stick+pea), until I added newyly-restored to the dixen (and turned off dynamic compounding for the rare word "staur", just in case)) |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
whats dynamic compounding again? |
|||
Unhammer |
|||
When a form doesn't get a regular analysis (using entries and paradigms etc), it would normally be output as *unknownform. But we can retry by splitting at all points of the string to see if it's analysable as two parts (where each part is analysable in the regular way, but the analyses must have certain special tags that say they're ok to be used in compounds) |
|||
If two parts doesn't work, we try three, four |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
okay I didn't even know this existed :p |
|||
that's interesting |
|||
and now another problem for eliminating trimming |
|||
Unhammer |
|||
(at least that's how lt-proc does it; in hfst that system is normally encoded in the fst with an arc from final to initial and a flag diacritic to restrict analyses, and a higher weight so they're down-prioritised, the effect is the same) |
|||
</pre> |
Latest revision as of 10:04, 3 May 2020
28 March 2020
khannatanmai i just cant think of a lot of uses of the surface form for a lot of the programs TinoDidriksen You should see how much secondary information VISL's streams have. Noun semantics, verb frames, dependency, markup tags, etc. Being able to carry any information along makes many things possible, often things you can't imagine because of current limitations. khannatanmai (@khannatanmai:matrix.org) yeah fair enough alright in-stream it is I wonder how many parsers will have to be modified for this popcorndude probably all of them TinoDidriksen Yup khannatanmai (@khannatanmai:matrix.org) maybe an interesting exercise will be to make the stream scalable in a sense so that we can add an arbitrary amount of information and the parsers wont have to be modified TinoDidriksen That is what I wanted out of the task. Dunno what tricks would be most useful. Tag prefixes? Delimiters? TinoDidriksen I would think prefixes would be most flexible. popcorndude ^x<y>/blah<z>/anaphor<q>:arbitrary information$ khannatanmai (@khannatanmai:matrix.org) or we keep adding more slashes :p popcorndude : being some separator character with arbitrary data after it (also separated by /?) which programs can examine or pass along unmodified as they see fit khannatanmai (@khannatanmai:matrix.org) also some way to identify what is the data TinoDidriksen No, it must be per-reading, because it's tied to each reading. khannatanmai (@khannatanmai:matrix.org) instead of just relying on order popcorndude {} TinoDidriksen Identification is why I say prefixes. popcorndude are reserved characters currently only used by t*x khannatanmai (@khannatanmai:matrix.org) how about completely overhaul the stream to forget the order and just read data type and data sl:xyz, stag:abc, anaphor:ghy, sf:xyzabc in a stream format of course too much? :p TinoDidriksen Well, it has to be human editable. popcorndude ^potato<n><sg>{case:aa/other-prefix:other-value}/patata<n><f><sg>{more:other}$ spectei eww :P khannatanmai (@khannatanmai:matrix.org) ^source:x<y>/target:blah<z>/anaphor:jky<q>/s_surface:yadayada/etc.$ TinoDidriksen That could almost work, though I'd prefer ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ - one segment per datum. khannatanmai (@khannatanmai:matrix.org) how's this^ spectei i'd prefer fixed semantics for slashes the feat:val is a bit cumbersome TinoDidriksen I don't see how you're going to avoid prefixing in Apertium's stream format. We can certainly say that the initial tags are kept as-is, but need to know where that ends and secondary information begins. And the secondary information needs prefixing or some other identification. khannatanmai (@khannatanmai:matrix.org) i was thinking of a way to remove any distinction between primary and secondary information TinoDidriksen Prefixing everything is too verbose. khannatanmai (@khannatanmai:matrix.org) hmmm. well the arbit info we'll have to prefix anyway so yeah we could have a mixture of syntax ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ this seems doable but looks ugly? :p TinoDidriksen Then do ^potato<n><sg><case:aa><other-prefix:other-value>/patata<n><f><sg><more:other>$ - no current tag has :, so tags with : must be secondary. And if they're always trailing, there's much less that can go wrong. Then we can centrally namespace things, with first namespace being s: for surface forms. khannatanmai (@khannatanmai:matrix.org) what about multiwords TinoDidriksen And later on if we want t: for markup tags, that would just be passed through as-is. khannatanmai (@khannatanmai:matrix.org) the arbit info would apply to them as a whole, so if it's in a tag markup you would have to put it on one of the parts or both the parts? TinoDidriksen Both parts khannatanmai (@khannatanmai:matrix.org) it would give some choice to put arbit info on the parts as well actually khannatanmai (@khannatanmai:matrix.org) ^potato<n><sg>case:aaother-prefix:other-value/patata<n><f><sg>more:other$ this looks good to me. number of tags is already arbitrary so it works. a word in monodix with arbit info on it would still match the word in bidix which doesnt have the info popcorndude getting -recursive to handle that properly might be an adventure khannatanmai (@khannatanmai:matrix.org) wait riot did something i think popcorndude but not too terrible TinoDidriksen Riot probably just showed it raised because of ^ ? khannatanmai (@khannatanmai:matrix.org) it does go against the usual semantics of a tag markup but we can live with it Riot changed it into a link or something khannatanmai (@khannatanmai:matrix.org) if this looks good I'll put it in the proposal and send a mail to the mailing list spectei firespeaker: not sure khannatanmai (@khannatanmai:matrix.org) spectei what do you think spectei: firespeaker my current list of tasks are there other priority tasks that i'm missing? I saw something about collab support, but i think that is already done popcorndude that also solves the issue of encoding rule numbers if we did try having a split mode for -recursive: add <rtx:17> to chunks TinoDidriksen It would solve so many issues... I was quite serious when I said this is the granddaddy of fundamental design constraints in Apertium. khannatanmai (@khannatanmai:matrix.org) well sounds fun. I'll send a mail to the mailing list and see what the people have to say so the project will be overhauling the apertium stream format and eliminating trimming in the process lol TinoDidriksen Well, it would render trimming superfluous. popcorndude passing such tags through t*x will be troublesome rtx too, but I think slightly less so TinoDidriksen I don't see why - they just get carried through. khannatanmai (@khannatanmai:matrix.org) doesnt transfer match the patterns if the tags are in there and ignore the extra tags? popcorndude I'm getting mixed up about a couple of different things in t*x t*x is probably fine TinoDidriksen If the token is multiplied, the input tags are also. If deleted, ditto. popcorndude rtx usually doesn't just pass things through khannatanmai (@khannatanmai:matrix.org) wow you could actually do a LOT of things once you can have arbitrary info TinoDidriksen Yup, it's extremely important. popcorndude rtx actually builds new tokens by grabbing pieces of the input, so actually it's just a matter of having it grab the arbitrary tags too popcorndude it's largely a matter of tweaking parsing loops so that the code that checks for <> also checks for : and if found either looks for prefixes or just copies following tags into some buffer for output khannatanmai (@khannatanmai:matrix.org) yeah seems doable TinoDidriksen It's absolutely a lot of work, but it is doable.
29 March 2020
xavivars khannatanmai I guess that if we don't allow adding data to the dix, we're constraining a lot the data that will be in the stream, having a super flexible format that works carries pretty much the same information that it carries now The important piece here is to make sure that is clear that all the this data is optional in the dix khannatanmai (@khannatanmai:matrix.org) yeah plus a lot of the secondary information can be information added manually, so not allowing data to the dix would make this considerably less powerful yup plus even with the modification of the parsers the current format should still work properly, which is why everything else will be secondary and optional xavivars Because that's what I think it was Hèctor's main concern khannatanmai (@khannatanmai:matrix.org) yeah I hope I made that clear in the reply I think what he was also saying was that I will add info in the dix for my task and that wont be relevant for everyone else who uses that dix, so there will be a lot of redundant info TinoDidriksen I wasn't thinking any of the secondary information goes in any dix files, but I guess it could... Initially, nothing goes into any dixes. khannatanmai (@khannatanmai:matrix.org) yeah initially it's just about what the programs can put in the stream xavivars Definitely not for trimming spectei i think it might make sense to come up with a list of desiderata before trying to come up with formats e.g. should it be human editable / sed+awk+greppable TinoDidriksen See, this is where the current constraints affect thinking. I don't think Hector can imagine this isn't dynamic stream information, so he's worried all the dixes will change. spectei what kind of information might we want to include? xavivars But examples like the ones you mention about sentiment, etc... They'll need to come from "somewhere" (being "somewhere" a dix or a data file for a future module in the pipeline) TinoDidriksen Initially: Surface form, markup tags, dependency. Things that do not go into dixes. khannatanmai (@khannatanmai:matrix.org) yeah sentiment was a future example, but stuff like surface form is already in the dix in my head the info can come either from data files or the programs TinoDidriksen Sure xavivars spectei, don't you think the proposed format is sed/awk/greppable? TinoDidriksen It's absolutely greppable. khannatanmai (@khannatanmai:matrix.org) in fact i was thinking about the "isn't" "ain't" example, and it might sound as not a big deal, but in the dix we can distinguish them with secondary information xavivars The think I liked the most when I saw the proposal in the email was that it felt that "nothing really changes" TinoDidriksen Which is true - nothing really changes, until people want to make use of it. It's opt-in, but the power to do so is a major overhaul. khannatanmai (@khannatanmai:matrix.org) yep. plus if we're going to be modifying the parsers of all the programs to include surface form, it is a good idea to make it so that we can include arbitrary info rather than modify all parsers tomorrow again TinoDidriksen Definitely khannatanmai (@khannatanmai:matrix.org) even when people modify the dix to include secondary information, it won't affect people who just use the primary info in their tasks xavivars Sorry, my point was that even if people uses it, "it looks the same": same format for tags, etc. Except that the concept of secondary (and because of that, optional) information gives huge opportunities
Dynamic Compounding 3 May 2020
heheh :) khannatanmai, without trimming, boazoguohtunkommišuvdna will be analysed as ^boazoguohtunkommišuvdna<n>$ Maybe there's a smart way to work around it, I haven't looked deeply into it, but it seems like a real challenge. khannatanmai (@khannatanmai:matrix.org) wouldnt the monodix identify it as a compound? Unhammer no khannatanmai (@khannatanmai:matrix.org) I thought boazoguohtunkommišuvdna is analysed as boazoguohtun<xyz>+kommišuvdna<xyz> (whatever the lemmas are) Unhammer only in the untrimmed dix sorry, trimmed in the untrimmed, you get boazoguohtunkommišuvdna<n><sem_org><sg><nom> in the trimmed, you get boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom> khannatanmai (@khannatanmai:matrix.org) Alright but then: boazoguohtunkommišuvdna<n><sem_org><sg><nom> these sort of entries shouldnt be in the monodix anyway right? If the word is a compound why is there an entry analysing it as one LU In the untrimmed dictionary we have both: boazoguohtunkommišuvdna -> boazoguohtunkommišuvdna<n><sem_org><sg><nom> boazoguohtunkommišuvdna -> boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom> right? trimming trims the first one since the whole compound wont be in the bidix Unhammer Dynamic compound analyses are generally more unsafe, so only used if we can't find a full analysis That's why we include both. khannatanmai (@khannatanmai:matrix.org) ah ok I wasnt aware of that Unhammer (e.g. on the form "nyrestaurert" (newly restored), I once saw the analysis "nyre+staur+ert" (kidney+stick+pea), until I added newyly-restored to the dixen (and turned off dynamic compounding for the rare word "staur", just in case)) khannatanmai (@khannatanmai:matrix.org) whats dynamic compounding again? Unhammer When a form doesn't get a regular analysis (using entries and paradigms etc), it would normally be output as *unknownform. But we can retry by splitting at all points of the string to see if it's analysable as two parts (where each part is analysable in the regular way, but the analyses must have certain special tags that say they're ok to be used in compounds) If two parts doesn't work, we try three, four khannatanmai (@khannatanmai:matrix.org) okay I didn't even know this existed :p that's interesting and now another problem for eliminating trimming Unhammer (at least that's how lt-proc does it; in hfst that system is normally encoded in the fst with an arc from final to initial and a flag diacritic to restrict analyses, and a higher weight so they're down-prioritised, the effect is the same)