Difference between revisions of "User talk:Khannatanmai/GSoC2020Proposal Trimming"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
28 March 2020
<pre>
<pre>
khannatanmai
khannatanmai
Line 143: Line 144:
TinoDidriksen
TinoDidriksen
It's absolutely a lot of work, but it is doable.
It's absolutely a lot of work, but it is doable.
</pre>

29 March 2020
<pre>
xavivars
khannatanmai I guess that if we don't allow adding data to the dix, we're constraining a lot the data that will be in the stream, having a super flexible format that works carries pretty much the same information that it carries now
The important piece here is to make sure that is clear that all the this data is optional in the dix
khannatanmai (@khannatanmai:matrix.org)
yeah plus a lot of the secondary information can be information added manually, so not allowing data to the dix would make this considerably less powerful
yup
plus even with the modification of the parsers the current format should still work properly, which is why everything else will be secondary and optional
xavivars
Because that's what I think it was Hèctor's main concern
khannatanmai (@khannatanmai:matrix.org)
yeah I hope I made that clear in the reply
I think what he was also saying was that I will add info in the dix for my task and that wont be relevant for everyone else who uses that dix, so there will be a lot of redundant info
TinoDidriksen
I wasn't thinking any of the secondary information goes in any dix files, but I guess it could...
Initially, nothing goes into any dixes.
khannatanmai (@khannatanmai:matrix.org)
yeah initially it's just about what the programs can put in the stream
xavivars
Definitely not for trimming
spectei
i think it might make sense to come up with a list of desiderata
before trying to come up with formats
e.g. should it be human editable / sed+awk+greppable
TinoDidriksen
See, this is where the current constraints affect thinking. I don't think Hector can imagine this isn't dynamic stream information, so he's worried all the dixes will change.
spectei
what kind of information might we want to include?
xavivars
But examples like the ones you mention about sentiment, etc... They'll need to come from "somewhere" (being "somewhere" a dix or a data file for a future module in the pipeline)
TinoDidriksen
Initially: Surface form, markup tags, dependency. Things that do not go into dixes.
khannatanmai (@khannatanmai:matrix.org)
yeah sentiment was a future example, but stuff like surface form is already in the dix
in my head the info can come either from data files or the programs
TinoDidriksen
Sure
xavivars
spectei, don't you think the proposed format is sed/awk/greppable?
TinoDidriksen
It's absolutely greppable.
khannatanmai (@khannatanmai:matrix.org)
in fact i was thinking about the "isn't" "ain't" example, and it might sound as not a big deal, but in the dix we can distinguish them with secondary information
xavivars
The think I liked the most when I saw the proposal in the email was that it felt that "nothing really changes"
TinoDidriksen
Which is true - nothing really changes, until people want to make use of it. It's opt-in, but the power to do so is a major overhaul.
khannatanmai (@khannatanmai:matrix.org)
yep. plus if we're going to be modifying the parsers of all the programs to include surface form, it is a good idea to make it so that we can include arbitrary info
rather than modify all parsers tomorrow again
TinoDidriksen
Definitely
khannatanmai (@khannatanmai:matrix.org)
even when people modify the dix to include secondary information, it won't affect people who just use the primary info in their tasks
xavivars
Sorry, my point was that even if people uses it, "it looks the same": same format for tags, etc. Except that the concept of secondary (and because of that, optional) information gives huge opportunities
</pre>

Dynamic Compounding 3 May 2020
<pre>
heheh :)
khannatanmai, without trimming, boazoguohtunkommišuvdna will be analysed as ^boazoguohtunkommišuvdna<n>$
Maybe there's a smart way to work around it, I haven't looked deeply into it, but it seems like a real challenge.
khannatanmai (@khannatanmai:matrix.org)
wouldnt the monodix identify it as a compound?
Unhammer
no
khannatanmai (@khannatanmai:matrix.org)
I thought boazoguohtunkommišuvdna is analysed as boazoguohtun<xyz>+kommišuvdna<xyz>
(whatever the lemmas are)
Unhammer
only in the untrimmed dix
sorry, trimmed
in the untrimmed, you get boazoguohtunkommišuvdna<n><sem_org><sg><nom>
in the trimmed, you get boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
khannatanmai (@khannatanmai:matrix.org)
Alright but then: boazoguohtunkommišuvdna<n><sem_org><sg><nom> these sort of entries shouldnt be in the monodix anyway right? If the word is a compound why is there an entry analysing it as one LU
In the untrimmed dictionary we have both:
boazoguohtunkommišuvdna -> boazoguohtunkommišuvdna<n><sem_org><sg><nom>

boazoguohtunkommišuvdna -> boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
right?
trimming trims the first one since the whole compound wont be in the bidix
Unhammer
Dynamic compound analyses are generally more unsafe, so only used if we can't find a full analysis
That's why we include both.
khannatanmai (@khannatanmai:matrix.org)
ah ok I wasnt aware of that
Unhammer
(e.g. on the form "nyrestaurert" (newly restored), I once saw the analysis "nyre+staur+ert" (kidney+stick+pea), until I added newyly-restored to the dixen (and turned off dynamic compounding for the rare word "staur", just in case))
khannatanmai (@khannatanmai:matrix.org)
whats dynamic compounding again?
Unhammer
When a form doesn't get a regular analysis (using entries and paradigms etc), it would normally be output as *unknownform. But we can retry by splitting at all points of the string to see if it's analysable as two parts (where each part is analysable in the regular way, but the analyses must have certain special tags that say they're ok to be used in compounds)
If two parts doesn't work, we try three, four
khannatanmai (@khannatanmai:matrix.org)
okay I didn't even know this existed :p
that's interesting
and now another problem for eliminating trimming
Unhammer
(at least that's how lt-proc does it; in hfst that system is normally encoded in the fst with an arc from final to initial and a flag diacritic to restrict analyses, and a higher weight so they're down-prioritised, the effect is the same)
</pre>
</pre>

Latest revision as of 10:04, 3 May 2020

28 March 2020

khannatanmai
i just cant think of a lot of uses of the surface form for a lot of the programs
TinoDidriksen
You should see how much secondary information VISL's streams have. Noun semantics, verb frames, dependency, markup tags, etc. Being able to carry any information along makes many things possible, often things you can't imagine because of current limitations.
khannatanmai (@khannatanmai:matrix.org)
yeah fair enough
alright in-stream it is
I wonder how many parsers will have to be modified for this
popcorndude
probably all of them
TinoDidriksen
Yup
khannatanmai (@khannatanmai:matrix.org)
maybe an interesting exercise will be to make the stream scalable in a sense so that we can add an arbitrary amount of information and the parsers wont have to be modified
TinoDidriksen
That is what I wanted out of the task.
Dunno what tricks would be most useful. Tag prefixes? Delimiters?
TinoDidriksen
I would think prefixes would be most flexible.
popcorndude
^x<y>/blah<z>/anaphor<q>:arbitrary information$
khannatanmai (@khannatanmai:matrix.org)
or we keep adding more slashes :p
popcorndude
: being some separator character
with arbitrary data after it (also separated by /?) which programs can examine or pass along unmodified as they see fit
khannatanmai (@khannatanmai:matrix.org)
also some way to identify what is the data
TinoDidriksen
No, it must be per-reading, because it's tied to each reading.
khannatanmai (@khannatanmai:matrix.org)
instead of just relying on order
popcorndude
{}
TinoDidriksen
Identification is why I say prefixes.
popcorndude
are reserved characters currently only used by t*x
khannatanmai (@khannatanmai:matrix.org)
how about completely overhaul the stream to forget the order and just read data type and data
sl:xyz, stag:abc, anaphor:ghy, sf:xyzabc
in a stream format of course
too much? :p
TinoDidriksen
Well, it has to be human editable.
popcorndude
^potato<n><sg>{case:aa/other-prefix:other-value}/patata<n><f><sg>{more:other}$
spectei
eww :P
khannatanmai (@khannatanmai:matrix.org)
^source:x<y>/target:blah<z>/anaphor:jky<q>/s_surface:yadayada/etc.$
TinoDidriksen
That could almost work, though I'd prefer ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ - one segment per datum.
khannatanmai (@khannatanmai:matrix.org)
how's this^
spectei
i'd prefer fixed semantics for slashes
the feat:val is a bit cumbersome
TinoDidriksen
I don't see how you're going to avoid prefixing in Apertium's stream format. We can certainly say that the initial tags are kept as-is, but need to know where that ends and secondary information begins.
And the secondary information needs prefixing or some other identification.
khannatanmai (@khannatanmai:matrix.org)
i was thinking of a way to remove any distinction between primary and secondary information
TinoDidriksen
Prefixing everything is too verbose.
khannatanmai (@khannatanmai:matrix.org)
hmmm. well the arbit info we'll have to prefix anyway so yeah we could have a mixture of syntax
^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ this seems doable
but looks ugly? :p 
TinoDidriksen
Then do ^potato<n><sg><case:aa><other-prefix:other-value>/patata<n><f><sg><more:other>$ - no current tag has :, so tags with : must be secondary.
And if they're always trailing, there's much less that can go wrong.
Then we can centrally namespace things, with first namespace being s: for surface forms.
khannatanmai (@khannatanmai:matrix.org)
what about multiwords
TinoDidriksen
And later on if we want t: for markup tags, that would just be passed through as-is.
khannatanmai (@khannatanmai:matrix.org)
the arbit info would apply to them as a whole, so if it's in a tag markup you would have to put it on one of the parts
or both the parts?
TinoDidriksen
Both parts
khannatanmai (@khannatanmai:matrix.org)
it would give some choice to put arbit info on the parts as well actually
khannatanmai (@khannatanmai:matrix.org)
^potato<n><sg>case:aaother-prefix:other-value/patata<n><f><sg>more:other$ this looks good to me. number of tags is already arbitrary so it works.
a word in monodix with arbit info on it would still match the word in bidix which doesnt have the info
popcorndude
getting -recursive to handle that properly might be an adventure
khannatanmai (@khannatanmai:matrix.org)
wait riot did something i think
popcorndude
but not too terrible
TinoDidriksen
Riot probably just showed it raised because of ^ ?
khannatanmai (@khannatanmai:matrix.org)
it does go against the usual semantics of a tag markup but we can live with it
Riot changed it into a link or something
khannatanmai (@khannatanmai:matrix.org)
if this looks good I'll put it in the proposal and send a mail to the mailing list
spectei
firespeaker: not sure
khannatanmai (@khannatanmai:matrix.org)
spectei what do you think
spectei: firespeaker my current list of tasks
are there other priority tasks that i'm missing?
I saw something about collab support, but i think that is already done
popcorndude
that also solves the issue of encoding rule numbers if we did try having a split mode for -recursive: add <rtx:17> to chunks
TinoDidriksen
It would solve so many issues...
I was quite serious when I said this is the granddaddy of fundamental design constraints in Apertium.
khannatanmai (@khannatanmai:matrix.org)
well sounds fun. I'll send a mail to the mailing list and see what the people have to say
so the project will be overhauling the apertium stream format and eliminating trimming in the process lol
TinoDidriksen
Well, it would render trimming superfluous.
popcorndude
passing such tags through t*x will be troublesome
rtx too, but I think slightly less so
TinoDidriksen
I don't see why - they just get carried through.
khannatanmai (@khannatanmai:matrix.org)
doesnt transfer match the patterns if the tags are in there and ignore the extra tags?
popcorndude
I'm getting mixed up about a couple of different things in t*x
t*x is probably fine
TinoDidriksen
If the token is multiplied, the input tags are also. If deleted, ditto.
popcorndude
rtx usually doesn't just pass things through
khannatanmai (@khannatanmai:matrix.org)
wow you could actually do a LOT of things once you can have arbitrary info
TinoDidriksen
Yup, it's extremely important.
popcorndude
rtx actually builds new tokens by grabbing pieces of the input, so actually it's just a matter of having it grab the arbitrary tags too
popcorndude
it's largely a matter of tweaking parsing loops so that the code that checks for <> also checks for : and if found either looks for prefixes or just copies following tags into some buffer for output
khannatanmai (@khannatanmai:matrix.org)
yeah seems doable
TinoDidriksen
It's absolutely a lot of work, but it is doable.

29 March 2020

xavivars
khannatanmai I guess that if we don't allow adding data to the dix, we're constraining a lot the data that will be in the stream, having a super flexible format that works carries pretty much the same information that it carries now
The important piece here is to make sure that is clear that all the this data is optional in the dix
khannatanmai (@khannatanmai:matrix.org)
yeah plus a lot of the secondary information can be information added manually, so not allowing data to the dix would make this considerably less powerful
yup
plus even with the modification of the parsers the current format should still work properly, which is why everything else will be secondary and optional
xavivars
Because that's what I think it was Hèctor's main concern
khannatanmai (@khannatanmai:matrix.org)
yeah I hope I made that clear in the reply
I think what he was also saying was that I will add info in the dix for my task and that wont be relevant for everyone else who uses that dix, so there will be a lot of redundant info
TinoDidriksen
I wasn't thinking any of the secondary information goes in any dix files, but I guess it could...
Initially, nothing goes into any dixes.
khannatanmai (@khannatanmai:matrix.org)
yeah initially it's just about what the programs can put in the stream
xavivars
Definitely not for trimming
spectei
i think it might make sense to come up with a list of desiderata
before trying to come up with formats
e.g. should it be human editable / sed+awk+greppable
TinoDidriksen
See, this is where the current constraints affect thinking. I don't think Hector can imagine this isn't dynamic stream information, so he's worried all the dixes will change.
spectei
what kind of information might we want to include?
xavivars
But examples like the ones you mention about sentiment, etc... They'll need to come from "somewhere" (being "somewhere" a dix or a data file for a future module in the pipeline)
TinoDidriksen
Initially: Surface form, markup tags, dependency. Things that do not go into dixes.
khannatanmai (@khannatanmai:matrix.org)
yeah sentiment was a future example, but stuff like surface form is already in the dix
in my head the info can come either from data files or the programs
TinoDidriksen
Sure
xavivars
spectei, don't you think the proposed format is sed/awk/greppable?
TinoDidriksen
It's absolutely greppable.
khannatanmai (@khannatanmai:matrix.org)
in fact i was thinking about the "isn't" "ain't" example, and it might sound as not a big deal, but in the dix we can distinguish them with secondary information
xavivars
The think I liked the most when I saw the proposal in the email was that it felt that "nothing really changes"
TinoDidriksen
Which is true - nothing really changes, until people want to make use of it. It's opt-in, but the power to do so is a major overhaul.
khannatanmai (@khannatanmai:matrix.org)
yep. plus if we're going to be modifying the parsers of all the programs to include surface form, it is a good idea to make it so that we can include arbitrary info
rather than modify all parsers tomorrow again
TinoDidriksen
Definitely
khannatanmai (@khannatanmai:matrix.org)
even when people modify the dix to include secondary information, it won't affect people who just use the primary info in their tasks
xavivars
Sorry, my point was that even if people uses it, "it looks the same": same format for tags, etc. Except that the concept of secondary (and because of that, optional) information gives huge opportunities

Dynamic Compounding 3 May 2020

heheh :)
khannatanmai, without trimming, boazoguohtunkommišuvdna will be analysed as ^boazoguohtunkommišuvdna<n>$
Maybe there's a smart way to work around it, I haven't looked deeply into it, but it seems like a real challenge.
khannatanmai (@khannatanmai:matrix.org)
wouldnt the monodix identify it as a compound?
Unhammer
no
khannatanmai (@khannatanmai:matrix.org)
I thought boazoguohtunkommišuvdna is analysed as boazoguohtun<xyz>+kommišuvdna<xyz>
(whatever the lemmas are)
Unhammer
only in the untrimmed dix
sorry, trimmed
in the untrimmed, you get boazoguohtunkommišuvdna<n><sem_org><sg><nom>
in the trimmed, you get boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
khannatanmai (@khannatanmai:matrix.org)
Alright but then: boazoguohtunkommišuvdna<n><sem_org><sg><nom> these sort of entries shouldnt be in the monodix anyway right? If the word is a compound why is there an entry analysing it as one LU
In the untrimmed dictionary we have both: 
boazoguohtunkommišuvdna -> boazoguohtunkommišuvdna<n><sem_org><sg><nom>

boazoguohtunkommišuvdna -> boazoguohtun<n><cmp_sggen><cmp>+kommišuvdna<n><sem_org><sg><nom>
right?
trimming trims the first one since the whole compound wont be in the bidix
Unhammer
Dynamic compound analyses are generally more unsafe, so only used if we can't find a full analysis
That's why we include both.
khannatanmai (@khannatanmai:matrix.org)
ah ok I wasnt aware of that
Unhammer
(e.g. on the form "nyrestaurert" (newly restored), I once saw the analysis "nyre+staur+ert" (kidney+stick+pea), until I added newyly-restored to the dixen (and turned off dynamic compounding for the rare word "staur", just in case))
khannatanmai (@khannatanmai:matrix.org)
whats dynamic compounding again?
Unhammer
When a form doesn't get a regular analysis (using entries and paradigms etc), it would normally be output as *unknownform. But we can retry by splitting at all points of the string to see if it's analysable as two parts (where each part is analysable in the regular way, but the analyses must have certain special tags that say they're ok to be used in compounds)
If two parts doesn't work, we try three, four
khannatanmai (@khannatanmai:matrix.org)
okay I didn't even know this existed :p
that's interesting
and now another problem for eliminating trimming
Unhammer
(at least that's how lt-proc does it; in hfst that system is normally encoded in the fst with an arc from final to initial and a flag diacritic to restrict analyses, and a higher weight so they're down-prioritised, the effect is the same)