User talk:Khannatanmai/GSoC2020Proposal Trimming

From Apertium
< User talk:Khannatanmai
Revision as of 07:47, 29 March 2020 by Khannatanmai (talk | contribs) (Created page with "khannatanmai i just cant think of a lot of uses of the surface form for a lot of the programs TinoDidriksen You should see how much secondary information VISL's streams have. ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

khannatanmai i just cant think of a lot of uses of the surface form for a lot of the programs TinoDidriksen You should see how much secondary information VISL's streams have. Noun semantics, verb frames, dependency, markup tags, etc. Being able to carry any information along makes many things possible, often things you can't imagine because of current limitations. khannatanmai (@khannatanmai:matrix.org) yeah fair enough alright in-stream it is I wonder how many parsers will have to be modified for this popcorndude probably all of them TinoDidriksen Yup khannatanmai (@khannatanmai:matrix.org) maybe an interesting exercise will be to make the stream scalable in a sense so that we can add an arbitrary amount of information and the parsers wont have to be modified TinoDidriksen That is what I wanted out of the task. Dunno what tricks would be most useful. Tag prefixes? Delimiters? TinoDidriksen I would think prefixes would be most flexible. popcorndude ^x<y>/blah<z>/anaphor:arbitrary information$ khannatanmai (@khannatanmai:matrix.org) or we keep adding more slashes :p popcorndude

being some separator character

with arbitrary data after it (also separated by /?) which programs can examine or pass along unmodified as they see fit khannatanmai (@khannatanmai:matrix.org) also some way to identify what is the data TinoDidriksen No, it must be per-reading, because it's tied to each reading. khannatanmai (@khannatanmai:matrix.org) instead of just relying on order popcorndude {} TinoDidriksen Identification is why I say prefixes. popcorndude are reserved characters currently only used by t*x khannatanmai (@khannatanmai:matrix.org) how about completely overhaul the stream to forget the order and just read data type and data sl:xyz, stag:abc, anaphor:ghy, sf:xyzabc in a stream format of course too much? :p TinoDidriksen Well, it has to be human editable. popcorndude ^potato<n><sg>{case:aa/other-prefix:other-value}/patata<n><f><sg>{more:other}$ spectei eww :P khannatanmai (@khannatanmai:matrix.org) ^source:x<y>/target:blah<z>/anaphor:jky/s_surface:yadayada/etc.$ TinoDidriksen That could almost work, though I'd prefer ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ - one segment per datum. khannatanmai (@khannatanmai:matrix.org) how's this^ spectei i'd prefer fixed semantics for slashes the feat:val is a bit cumbersome TinoDidriksen I don't see how you're going to avoid prefixing in Apertium's stream format. We can certainly say that the initial tags are kept as-is, but need to know where that ends and secondary information begins. And the secondary information needs prefixing or some other identification. khannatanmai (@khannatanmai:matrix.org) i was thinking of a way to remove any distinction between primary and secondary information TinoDidriksen Prefixing everything is too verbose. khannatanmai (@khannatanmai:matrix.org) hmmm. well the arbit info we'll have to prefix anyway so yeah we could have a mixture of syntax ^potato<n><sg>{case:aa}{other-prefix:other-value}/patata<n><f><sg>{more:other}$ this seems doable but looks ugly? :p TinoDidriksen Then do ^potato<n><sg><case:aa><other-prefix:other-value>/patata<n><f><sg><more:other>$ - no current tag has :, so tags with : must be secondary. And if they're always trailing, there's much less that can go wrong. Then we can centrally namespace things, with first namespace being s: for surface forms. khannatanmai (@khannatanmai:matrix.org) what about multiwords TinoDidriksen And later on if we want t: for markup tags, that would just be passed through as-is. khannatanmai (@khannatanmai:matrix.org) the arbit info would apply to them as a whole, so if it's in a tag markup you would have to put it on one of the parts or both the parts? TinoDidriksen Both parts khannatanmai (@khannatanmai:matrix.org) it would give some choice to put arbit info on the parts as well actually khannatanmai (@khannatanmai:matrix.org) ^potato<n><sg>case:aaother-prefix:other-value/patata<n><f><sg>more:other$ this looks good to me. number of tags is already arbitrary so it works. a word in monodix with arbit info on it would still match the word in bidix which doesnt have the info popcorndude getting -recursive to handle that properly might be an adventure khannatanmai (@khannatanmai:matrix.org) wait riot did something i think popcorndude but not too terrible TinoDidriksen Riot probably just showed it raised because of ^ ? khannatanmai (@khannatanmai:matrix.org) it does go against the usual semantics of a tag markup but we can live with it Riot changed it into a link or something khannatanmai (@khannatanmai:matrix.org) if this looks good I'll put it in the proposal and send a mail to the mailing list spectei firespeaker: not sure khannatanmai (@khannatanmai:matrix.org) spectei what do you think spectei: firespeaker my current list of tasks are there other priority tasks that i'm missing? I saw something about collab support, but i think that is already done popcorndude that also solves the issue of encoding rule numbers if we did try having a split mode for -recursive: add <rtx:17> to chunks TinoDidriksen It would solve so many issues... I was quite serious when I said this is the granddaddy of fundamental design constraints in Apertium. khannatanmai (@khannatanmai:matrix.org) well sounds fun. I'll send a mail to the mailing list and see what the people have to say so the project will be overhauling the apertium stream format and eliminating trimming in the process lol TinoDidriksen Well, it would render trimming superfluous. popcorndude passing such tags through t*x will be troublesome rtx too, but I think slightly less so TinoDidriksen I don't see why - they just get carried through. khannatanmai (@khannatanmai:matrix.org) doesnt transfer match the patterns if the tags are in there and ignore the extra tags? popcorndude I'm getting mixed up about a couple of different things in t*x t*x is probably fine TinoDidriksen If the token is multiplied, the input tags are also. If deleted, ditto. popcorndude rtx usually doesn't just pass things through khannatanmai (@khannatanmai:matrix.org) wow you could actually do a LOT of things once you can have arbitrary info TinoDidriksen Yup, it's extremely important. popcorndude rtx actually builds new tokens by grabbing pieces of the input, so actually it's just a matter of having it grab the arbitrary tags too popcorndude it's largely a matter of tweaking parsing loops so that the code that checks for <> also checks for : and if found either looks for prefixes or just copies following tags into some buffer for output khannatanmai (@khannatanmai:matrix.org) yeah seems doable TinoDidriksen It's absolutely a lot of work, but it is doable.