User talk:Khannatanmai/GSoC2020Proposal Trimming

From Apertium
Jump to navigation Jump to search
khannatanmai
i just cant think of a lot of uses of the surface form for a lot of the programs
TinoDidriksen
You should see how much secondary information VISL's streams have. Noun semantics, verb frames, dependency, markup tags, etc. Being able to carry any information along makes many things possible, often things you can't imagine because of current limitations.
khannatanmai (@khannatanmai:matrix.org)
yeah fair enough
alright in-stream it is
I wonder how many parsers will have to be modified for this
popcorndude
probably all of them
TinoDidriksen
Yup
khannatanmai (@khannatanmai:matrix.org)
maybe an interesting exercise will be to make the stream scalable in a sense so that we can add an arbitrary amount of information and the parsers wont have to be modified
TinoDidriksen
That is what I wanted out of the task.
Dunno what tricks would be most useful. Tag prefixes? Delimiters?
TinoDidriksen
I would think prefixes would be most flexible.
popcorndude
^x<y>/blah<z>/anaphor<q>:arbitrary information$
khannatanmai (@khannatanmai:matrix.org)
or we keep adding more slashes :p
popcorndude
: being some separator character
with arbitrary data after it (also separated by /?) which programs can examine or pass along unmodified as they see fit
khannatanmai (@khannatanmai:matrix.org)
also some way to identify what is the data
TinoDidriksen
No, it must be per-reading, because it's tied to each reading.
khannatanmai (@khannatanmai:matrix.org)
instead of just relying on order
popcorndude
{}
TinoDidriksen
Identification is why I say prefixes.
popcorndude
are reserved characters currently only used by t*x
khannatanmai (@khannatanmai:matrix.org)
how about completely overhaul the stream to forget the order and just read data type and data
sl:xyz, stag:abc, anaphor:ghy, sf:xyzabc
in a stream format of course
too much? :p
TinoDidriksen
Well, it has to be human editable.
popcorndude
^potato<n><sg>{case:aa/other-prefix:other-value}/patata<n><f><sg>{more:other}$
spectei
eww :P
khannatanmai (@khannatanmai:matrix.org)
^source:x<y>/target:blah<z>/anaphor:jky<q>/s_surface:yadayada/etc.$
TinoDidriksen
That could almost work, though I'd prefer ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ - one segment per datum.
khannatanmai (@khannatanmai:matrix.org)
how's this^
spectei
i'd prefer fixed semantics for slashes
the feat:val is a bit cumbersome
TinoDidriksen
I don't see how you're going to avoid prefixing in Apertium's stream format. We can certainly say that the initial tags are kept as-is, but need to know where that ends and secondary information begins.
And the secondary information needs prefixing or some other identification.
khannatanmai (@khannatanmai:matrix.org)
i was thinking of a way to remove any distinction between primary and secondary information
TinoDidriksen
Prefixing everything is too verbose.
khannatanmai (@khannatanmai:matrix.org)
hmmm. well the arbit info we'll have to prefix anyway so yeah we could have a mixture of syntax
^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ this seems doable
but looks ugly? :p 
TinoDidriksen
Then do ^potato<n><sg><case:aa><other-prefix:other-value>/patata<n><f><sg><more:other>$ - no current tag has :, so tags with : must be secondary.
And if they're always trailing, there's much less that can go wrong.
Then we can centrally namespace things, with first namespace being s: for surface forms.
khannatanmai (@khannatanmai:matrix.org)
what about multiwords
TinoDidriksen
And later on if we want t: for markup tags, that would just be passed through as-is.
khannatanmai (@khannatanmai:matrix.org)
the arbit info would apply to them as a whole, so if it's in a tag markup you would have to put it on one of the parts
or both the parts?
TinoDidriksen
Both parts
khannatanmai (@khannatanmai:matrix.org)
it would give some choice to put arbit info on the parts as well actually
khannatanmai (@khannatanmai:matrix.org)
^potato<n><sg>case:aaother-prefix:other-value/patata<n><f><sg>more:other$ this looks good to me. number of tags is already arbitrary so it works.
a word in monodix with arbit info on it would still match the word in bidix which doesnt have the info
popcorndude
getting -recursive to handle that properly might be an adventure
khannatanmai (@khannatanmai:matrix.org)
wait riot did something i think
popcorndude
but not too terrible
TinoDidriksen
Riot probably just showed it raised because of ^ ?
khannatanmai (@khannatanmai:matrix.org)
it does go against the usual semantics of a tag markup but we can live with it
Riot changed it into a link or something
khannatanmai (@khannatanmai:matrix.org)
if this looks good I'll put it in the proposal and send a mail to the mailing list
spectei
firespeaker: not sure
khannatanmai (@khannatanmai:matrix.org)
spectei what do you think
spectei: firespeaker my current list of tasks
are there other priority tasks that i'm missing?
I saw something about collab support, but i think that is already done
popcorndude
that also solves the issue of encoding rule numbers if we did try having a split mode for -recursive: add <rtx:17> to chunks
TinoDidriksen
It would solve so many issues...
I was quite serious when I said this is the granddaddy of fundamental design constraints in Apertium.
khannatanmai (@khannatanmai:matrix.org)
well sounds fun. I'll send a mail to the mailing list and see what the people have to say
so the project will be overhauling the apertium stream format and eliminating trimming in the process lol
TinoDidriksen
Well, it would render trimming superfluous.
popcorndude
passing such tags through t*x will be troublesome
rtx too, but I think slightly less so
TinoDidriksen
I don't see why - they just get carried through.
khannatanmai (@khannatanmai:matrix.org)
doesnt transfer match the patterns if the tags are in there and ignore the extra tags?
popcorndude
I'm getting mixed up about a couple of different things in t*x
t*x is probably fine
TinoDidriksen
If the token is multiplied, the input tags are also. If deleted, ditto.
popcorndude
rtx usually doesn't just pass things through
khannatanmai (@khannatanmai:matrix.org)
wow you could actually do a LOT of things once you can have arbitrary info
TinoDidriksen
Yup, it's extremely important.
popcorndude
rtx actually builds new tokens by grabbing pieces of the input, so actually it's just a matter of having it grab the arbitrary tags too
popcorndude
it's largely a matter of tweaking parsing loops so that the code that checks for <> also checks for : and if found either looks for prefixes or just copies following tags into some buffer for output
khannatanmai (@khannatanmai:matrix.org)
yeah seems doable
TinoDidriksen
It's absolutely a lot of work, but it is doable.