Difference between revisions of "User talk:Khannatanmai/New Apertium stream format"
Jump to navigation
Jump to search
Khannatanmai (talk | contribs) |
Khannatanmai (talk | contribs) (Prefix format in Secondary Tags) |
||
Line 174: | Line 174: | ||
yeah that could work |
yeah that could work |
||
cool done then |
cool done then |
||
</pre> |
|||
Prefix format in Secondary Tags |
|||
<pre> |
|||
spectei |
|||
(prefix matching) |
|||
but for pattern matching they should be matchable |
|||
also, i think it would be better if they just had a single char prefix like syntax or semantic tags |
|||
i don't like embedding feature=value in apertium tags |
|||
TinoDidriksen |
|||
It's not feature=value |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
whats the difference between prefix matching and pattern matching? |
|||
spectei |
|||
so like <n><sg><gen><@subj><§agent><:human> |
|||
for the sf if people insist on it it could be with % |
|||
TinoDidriksen |
|||
No, because then we have to codify ALL possible secondary tags now, which we can't do. |
|||
So that's impossible. |
|||
spectei |
|||
so like <n><sg><gen><%отца><@subj><§agent><:human> |
|||
TinoDidriksen: no we wouldn't have to |
|||
: would just indicate secondary tag |
|||
and then it's up to the language (pair) person to decide how to use it |
|||
TinoDidriksen |
|||
: does indicate secondary tag... |
|||
spectei |
|||
without the prefix |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
how would you distinguish different types of secondary tags though? |
|||
spectei |
|||
khannatanmai: in general you woudn't need to |
|||
TinoDidriksen |
|||
Then you just need an extra infix delimiter. |
|||
spectei |
|||
for the surface form thing you can use a different symbol prefix |
|||
e.g. % |
|||
<n><sg><gen><%отца><@subj><§agent><:human><:kin> |
|||
TinoDidriksen |
|||
And for markup tags? <t:a> <t:span> etc is nice and short. |
|||
spectei |
|||
i wouldn't use this for markup tags, but if necessary you could just put them with some other symbol |
|||
TinoDidriksen |
|||
Well, <t:a:1> <t:span:5> since we'd want unique IDs. |
|||
spectei |
|||
like ! |
|||
<n><sg><gen><%отца><@subj><§agent><:human><:kin><!a><!span> |
|||
if unique ids: |
|||
<n><sg><gen><%отца><@subj><§agent><:human><:kin><!aef31><!fcd32> |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
overall isnt this also just using symbols as features? what's the benefit - given that we still have to make FSTs ignore them |
|||
TinoDidriksen |
|||
There's no benefit. It's exactly just using symbols as namespaces. |
|||
Except it's harder to read. |
|||
spectei |
|||
khannatanmai: you don't |
|||
you can just say that any tag that starts with [^a-z1-9] is a special tag |
|||
it's easier to read |
|||
TinoDidriksen |
|||
... |
|||
spectei |
|||
<foo:bar> is horrible |
|||
oops [^a-zA-Z0-9] |
|||
TinoDidriksen |
|||
What we do is say any tag that contains : is special. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
foo:bar is horrible. |
|||
just so I'm on the same page. why though? |
|||
TinoDidriksen |
|||
I would say %bar !a is horrible. It's so much more noisy. |
|||
spectei |
|||
<n><sg><gen><sf:отца><syn:subj><role:agent><sem:human><sem:kin><t:a:1><t:span:2> |
|||
vs. |
|||
<n><sg><gen><%отца><@subj><§agent><:human><:kin><!aef31><!fcd32> |
|||
TinoDidriksen |
|||
Yes, : is much nicer to read. |
|||
spectei |
|||
i disagree |
|||
i can barely read it |
|||
khannatanmai: it's hard to read... it's hard to filter out the prefixes from the values |
|||
it makes the tag streams longer |
|||
it's "ugly" in an aesthetic sense |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
I understand what you mean, but for someone reading the other stream will have to know what each of the symbols mean |
|||
spectei |
|||
hardly |
|||
you can tell from the values |
|||
and memorising three symbols is easy |
|||
TinoDidriksen |
|||
It won't be three. |
|||
We'll run out of symbol prefixes. |
|||
spectei |
|||
i don't see why |
|||
if we run out of reasonable symbol prefixes, the whole thing needs to be redesigned |
|||
TinoDidriksen |
|||
But with : tags, it won't. It'll all just work, regardless of what people do with it. |
|||
Future proofing. |
|||
spectei |
|||
if it gets to that people |
|||
to that point |
|||
people can do :sf_ ... whatever |
|||
but i don't want to encourage ugliness |
|||
khannatanmai: you can always make the two systems optional and let people choose |
|||
TinoDidriksen |
|||
My whole point is to make a system NOW that will work for all future needs, and : will do that. |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
<n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution |
|||
spectei |
|||
the difference essentially amounts to bikeshedding |
|||
TinoDidriksen |
|||
It's not bikeshedding. It's a fundamental implementation difference. |
|||
spectei |
|||
TinoDidriksen: the implementation is making the tags epsilons |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
<n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution |
|||
this sort of thing would in a way make it optional? |
|||
spectei |
|||
the design part is choosing which tags to make |
|||
khannatanmai: i don't hate that as much |
|||
TinoDidriksen |
|||
With :, we can once and for all say : tags are handled in this case. Without :, we need to codify all prefixes. |
|||
spectei |
|||
<n><sg><gen><%:отца><@:subj><§:agent><:human><:kin><!:aef31><!:fcd32> |
|||
khannatanmai: so long as the prefix could be null |
|||
TinoDidriksen: wrong |
|||
khannatanmai (@khannatanmai:matrix.org) |
|||
of course, I hadn't planned to give the user an option as to what the secondary tags prefixes are going to be |
|||
TinoDidriksen |
|||
Sure, null prefix should be fine. My definition would just be any tag containing : is secondary. |
|||
</pre> |
</pre> |
Latest revision as of 17:44, 8 May 2020
Implementation of querying of secondary tags (1 April 2020)
[07:56:09] <TinoDidriksen> khannatanmai, there's also _multimap, but order will matter. At minimum, the input order must be preserved, and eventually people will want to query these things where order matters. [07:57:09] <TinoDidriksen> That was one of the fundamental design flaws I made in CG-3. I figured that surely tag order wouldn't matter in queries, but there are uses for it. [07:58:32] <khannatanmai> yeah true [08:00:49] <khannatanmai> apparently LinkedListMultimap preserves insertion order [08:02:01] <khannatanmai> and for the same key I think it already preserves insertion order [08:04:09] <TinoDidriksen> I haven't fully fixed CG-3's storage yet, but what I had in mind was unordered_multimap<Tag,size_t> so that an existence check can still be done as O(1) hash lookup, but input is preserved in the size_t. [08:05:59] <TinoDidriksen> Have to preserve input order even when groups are mixed. <t:a><y:b><t:c> [08:06:37] <khannatanmai> don't you think even a normal map would work for an existence check? unless ofc if existence check could also mean existence of multiple duplicate keys [08:07:14] <khannatanmai> oh wait you're talking about putting the whole tag in the map? [08:07:33] <TinoDidriksen> For this number of tags, sure. A flat map would probably be best performance. [08:07:54] <TinoDidriksen> As a rule, don't use linked lists for anything. They're just bad. [08:11:12] <khannatanmai> what exactly do you mean by "input is preserved in the size_t" ? [08:11:24] <khannatanmai> and in your map, Tag is the whole tag or just the prefix [08:11:43] <TinoDidriksen> Whole tag, mapped to input position. [08:12:01] <khannatanmai> ah input position, alright [08:12:30] <khannatanmai> how about an array to preserve prefix order and an unordered multimap to map prefix to values [08:12:35] <TinoDidriksen> So <t:a><y:b><t:c> would be stored as t:a -> 1, t:c ->3, y:b -> 2 [08:12:51] <TinoDidriksen> Not prefix order. Tag order. [08:13:46] <khannatanmai> if you preserve prefix order, and for duplicate prefixes you preserve the order of values, that amounts to preserving tag order right [08:13:51] <TinoDidriksen> Nope [08:14:48] <khannatanmai> So <t:a><y:b><t:c> would become [t,y,t], [t->a,c ; y->b] [08:15:05] <TinoDidriksen> How do you get <t:a><y:b><t:c> back out from that? [08:15:15] <TinoDidriksen> Oh, like that. [08:15:27] <TinoDidriksen> Sure, that could work. [08:16:27] <khannatanmai> the existence check would still come from the unordered_map, plus the usual benefit of getting value from key [08:17:24] <TinoDidriksen> Existence would be for the whole tag. [08:17:49] <khannatanmai> oh I was thinking it as checking for existence of a prefix [08:17:59] <khannatanmai> *of it [08:18:40] <khannatanmai> i.e. is the surface form in this LU [08:19:31] <TinoDidriksen> Sure, that's also useful. [08:20:04] *** Joins: Weizhe (~Weizhe@117.136.107.66) [08:20:19] <khannatanmai> where would one check for the existence of a whole tag? (im assuming we're only talking about the secondary tags) [08:21:01] <khannatanmai> so the check would be for a predix-value pair, yeah I can see that [08:21:15] <khannatanmai> *prefix [08:21:35] <TinoDidriksen> I'm thinking in CG terms. You usually query whole tags. [08:22:18] <khannatanmai> Do CG tags have a feature-value pair? [08:22:22] <TinoDidriksen> But for many tools it would be just passing along a string with all secondary information as-is, since they don't need to query it. [08:24:02] <khannatanmai> yup [08:24:03] <TinoDidriksen> CG is pretty flexible. A tag is anything you want it to be. Some patterns are further blessed with extra behavior. [08:24:52] <khannatanmai> so when we talk about querying here, we're only talking about secondary info right? cause don't want to mess with how the programs already use the primary tags [08:28:20] <TinoDidriksen> Yup. I don't see how that would change in either case. [08:28:35] <TinoDidriksen> If people ask for t:.* it's pretty clear what they want. [08:30:23] <khannatanmai> yeah I just meant it seems like everything can be done by querying prefixes and getting values, rather than querying whole tags. even querying whole tags could be divided into querying prefix and checking value [08:30:58] <khannatanmai> I mean I guess this is querying whole tags in a way [08:31:26] <TinoDidriksen> Certainly. I just don't see what you gain by splitting the storage. To me, that says someone would want to just look at values, which makes no sense in my head. [08:33:13] <khannatanmai> the values is where the information is right? the prefixes will be known beforehand. i.e. i enter a prefix and it gives me all the values associated to it, and then i use them for whatever purpose [08:33:57] <TinoDidriksen> Yeah, that could be something people want to do... [08:38:48] <TinoDidriksen> Well, now I'm thinking of efficient storage. flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} - string_view into some larger shared storage to prevent multiple allocations. That would be amortized 1 allocation per reading, which is hard to beat. [08:48:13] <khannatanmai> and I'll be able to query this by saying something like my tag is {t, .*} ? [08:51:40] <TinoDidriksen> lower_bound() for {t, null} would find the first tag that has prefix t in O(log n) time, but given we assume there's few of these and flat storage, this will be a very low constant. [08:52:20] <TinoDidriksen> Then increment the iterator until prefix no longer matches. [08:54:12] <khannatanmai> the reason I was proposing separate storage, is that if I use an array to preserve position, and a multimap to map prefix->value, I don't need to use the array if I don't care about the position, and the lookup will be really fast [08:54:34] <khannatanmai> so if I just want the surface form, I'll just look for values for the 'sf' key in the multimap, get them and be done with it [08:55:03] <khannatanmai> this is of course assuming that order is something not everyone would want, or that it would be less needed than just a value query [08:55:35] <TinoDidriksen> Almost nobody will want the order, but it must be preserved and available. [08:56:15] <khannatanmai> yeah that's why the array is there, to preserve it and reconstruct tag order if needed [08:57:03] <khannatanmai> I just think if most of the use is going to be getting values given prefix, then that's what it should make the fastest [08:59:13] <TinoDidriksen> Yeah, but I'm sure sure this is where the real world wins over algorithmic perfection. Split storage would on paper be faster, but flat map would in reality be faster. CPU caches and indirection makes a big difference. But it should be pretty easy to work out, both on paper and with a quick test program. [09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write. [09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way
GENERATION ISSUES WITH SECONDARY TAGS (13 April 2020)
it's not able to generate the surface forms I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz am I wrong? or is this a different issue TinoDidriksen I thought that would work as well... khannatanmai (@khannatanmai:matrix.org) yeah I tested it a bit more this isn't a problem with the secondary tags apparently a generator needs an exact match TinoDidriksen Well, it's the generator that needs to consume most of these tags anyway, so fine. khannatanmai (@khannatanmai:matrix.org) interestingly the generator debug mode doesnt print the secondary tags. weird. but it's definitely a more fundamental problem. I think I need to go through the generator documentation a bit more I guess the bidix has a special functionality to let through underspecified analyses? which is why bidix just copies the secondary tags on to the tl side TinoDidriksen Which makes sense. Usually you don't need much to disambiguate, so why force people to write out all tags. But I don't see why the generator doesn't shortcut in the same way. Is there a cmdline flag to make it do that? Not that it matters. The generator absolutely needs modification in any case. khannatanmai (@khannatanmai:matrix.org) can't see a helpful flag for lt-proc. I'm just thinking about why it's like this TinoDidriksen I can see why it would reject underspecified forms, but not overspecified. It should just print when it reaches a terminal and discard the remainder. But again, I assume it's an artifact of how FSTs work...they're not very flexible. khannatanmai (@khannatanmai:matrix.org) if my bidix gives xyz<n><f> in the tl, and my monodix has xyzes -> xyz<n>, you'd still want xyzes right yeah underspecified would make sense ofc well it cant just reach a terminal and discard because if xyz<n> and xyz<n><f>, both exist in the monodix then it would never reach the latter TinoDidriksen A final terminal, then. khannatanmai (@khannatanmai:matrix.org) yeah if it's reached a terminal stage and then later the input ends and no new terminal stage comes then it should use the last terminal im just trying to think if someone decided it's better to have just the lemma instead of trying to just assign something to an overspecified form like I have dog->dog<n> in my monodix, and I get dog<n><pl> from the bidix lookup, and then since idk what the plural form is I'll just give a generation error sort of makes sense I guess? TinoDidriksen For development, sure. It should be possible to detect these issues. But for production, you probably want the most generation you can get, potentially with a mark to show it's incomplete. khannatanmai (@khannatanmai:matrix.org) incomplete often means super incorrect though. In a way they did do that by showing the lemma and a # to show we didnt find an exact match in the dictionary I'm just wondering if a better way is to have a way to ignore secondary tags when matching, rather than modifying the generator to match overspecified forms unless if the dix has secondary tags as well (for the future), then it shouldnt ignore them while matching TinoDidriksen The generator has to use the secondary tags. It is what needs to deal with passed through surface forms and markup tags. khannatanmai (@khannatanmai:matrix.org) yeah it should definitely use them, I'm just saying it shouldn't use them while matching dictionary forms anyway if we don't have secondary tags in the dictionary right now, the secondary tags aren't gonna play a part in the matching. in terms of trimming, the bidix wont have a tl output for the word so this wont be an issue, and for markup tags, the matching with surface form still happens independent of them firespeaker [19:18:00] <khannatanmai> I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz I don't understand this at all if you don't have xyz:xy<a><b><c><z> in the transducer, then xy<a><b><c><z> won't generate xyz this is just how transducers work khannatanmai yeah but the bidix is able to do it if you have xy<a>/uv<b> in a bidix and you give it xy<a><y> it will give you uv<b><y> and I understand it's just matching and copying the remainder tags firespeaker the bidix is matching substrings yeah exactly if you could do what you're trying to do then when you generate dog<n><pl>, then you would get dog and dogs (if your singular is dog:dog<n>) it would be terrible for agglutinating languages khannatanmai I mean if there's a match for the more specified form then ofcourse you would ignore the other matches sort of like LRLM firespeaker ат<n><pl><px1sg><dat> would generate dozens of forms hm, I mean, I could see something that works that way, but that's just no how FST processors work Flammie might have further thoughts khannatanmai you could get all the matches and just generate the one that generated with the longest string match But of course that would mean dog<n><pl> would generate "dog", if we only have dog:dog<n> in the monodix, instead of "#dog" which is why another solution is to not use secondary tags in the transducer so we have: make it an lrlm sort of thing and get all possible matches but generate only the longest match ignore secondary tags while doing the matching firespeaker I believe you said you were going to put secondary tags in the monodix? either that, or they're not something relevant to analysis and generation stages khannatanmai in the future we could. if we do that then we cant just ignore them yes maybe we could implement something like, if the current possible paths in the transducer contain a secondary tag, then don't ignore the secondary tag, otherwise ignore but im not sure how one would do that I wanna say for now we can say that we dont want to add secondary tags to the dixes, but a solution that deals with something like that might be nice or we can just all decide we dont wanna add secondary tags to the dixes firespeaker (16:27:55) khannatanmai: but im not sure how one would do that sounds like you'd need to change lt-proc and hfst-proc's algorithms khannatanmai: I think it's safe to ignore for now with the caveat that we may want to do something with it at some point (16:08:25) khannatanmai: how's lockdown with kids going ugh so much fun lol khannatanmai even if we ignore them we'll have to change the algorithms a bit though. But yeah not as much as the other thing. So I guess for now we'll assume we don't have secondary tags in the dictionaries TinoDidriksen? firespeaker haha enjoy lol TinoDidriksen Definitely assume there are no secondary tags in the generator. If they were needed there, they'd be primary. khannatanmai when I was thinking of ways in which secondary tags could be useful, I thought you could use them to distinguish close senses of a word by putting that in the dictionary but yeah I guess you could make primary tags for that TinoDidriksen They can be in the analyser monodix and be used throughout the pipe. khannatanmai the monodix is the same tho TinoDidriksen I know khannatanmai by restricting the direction? yeah that could work cool done then
Prefix format in Secondary Tags
spectei (prefix matching) but for pattern matching they should be matchable also, i think it would be better if they just had a single char prefix like syntax or semantic tags i don't like embedding feature=value in apertium tags TinoDidriksen It's not feature=value khannatanmai (@khannatanmai:matrix.org) whats the difference between prefix matching and pattern matching? spectei so like <n><sg><gen><@subj><§agent><:human> for the sf if people insist on it it could be with % TinoDidriksen No, because then we have to codify ALL possible secondary tags now, which we can't do. So that's impossible. spectei so like <n><sg><gen><%отца><@subj><§agent><:human> TinoDidriksen: no we wouldn't have to : would just indicate secondary tag and then it's up to the language (pair) person to decide how to use it TinoDidriksen : does indicate secondary tag... spectei without the prefix khannatanmai (@khannatanmai:matrix.org) how would you distinguish different types of secondary tags though? spectei khannatanmai: in general you woudn't need to TinoDidriksen Then you just need an extra infix delimiter. spectei for the surface form thing you can use a different symbol prefix e.g. % <n><sg><gen><%отца><@subj><§agent><:human><:kin> TinoDidriksen And for markup tags? <t:a> <t:span> etc is nice and short. spectei i wouldn't use this for markup tags, but if necessary you could just put them with some other symbol TinoDidriksen Well, <t:a:1> <t:span:5> since we'd want unique IDs. spectei like ! <n><sg><gen><%отца><@subj><§agent><:human><:kin><!a><!span> if unique ids: <n><sg><gen><%отца><@subj><§agent><:human><:kin><!aef31><!fcd32> khannatanmai (@khannatanmai:matrix.org) overall isnt this also just using symbols as features? what's the benefit - given that we still have to make FSTs ignore them TinoDidriksen There's no benefit. It's exactly just using symbols as namespaces. Except it's harder to read. spectei khannatanmai: you don't you can just say that any tag that starts with [^a-z1-9] is a special tag it's easier to read TinoDidriksen ... spectei <foo:bar> is horrible oops [^a-zA-Z0-9] TinoDidriksen What we do is say any tag that contains : is special. khannatanmai (@khannatanmai:matrix.org) foo:bar is horrible. just so I'm on the same page. why though? TinoDidriksen I would say %bar !a is horrible. It's so much more noisy. spectei <n><sg><gen><sf:отца><syn:subj><role:agent><sem:human><sem:kin><t:a:1><t:span:2> vs. <n><sg><gen><%отца><@subj><§agent><:human><:kin><!aef31><!fcd32> TinoDidriksen Yes, : is much nicer to read. spectei i disagree i can barely read it khannatanmai: it's hard to read... it's hard to filter out the prefixes from the values it makes the tag streams longer it's "ugly" in an aesthetic sense khannatanmai (@khannatanmai:matrix.org) I understand what you mean, but for someone reading the other stream will have to know what each of the symbols mean spectei hardly you can tell from the values and memorising three symbols is easy TinoDidriksen It won't be three. We'll run out of symbol prefixes. spectei i don't see why if we run out of reasonable symbol prefixes, the whole thing needs to be redesigned TinoDidriksen But with : tags, it won't. It'll all just work, regardless of what people do with it. Future proofing. spectei if it gets to that people to that point people can do :sf_ ... whatever but i don't want to encourage ugliness khannatanmai: you can always make the two systems optional and let people choose TinoDidriksen My whole point is to make a system NOW that will work for all future needs, and : will do that. khannatanmai (@khannatanmai:matrix.org) <n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution spectei the difference essentially amounts to bikeshedding TinoDidriksen It's not bikeshedding. It's a fundamental implementation difference. spectei TinoDidriksen: the implementation is making the tags epsilons khannatanmai (@khannatanmai:matrix.org) <n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution this sort of thing would in a way make it optional? spectei the design part is choosing which tags to make khannatanmai: i don't hate that as much TinoDidriksen With :, we can once and for all say : tags are handled in this case. Without :, we need to codify all prefixes. spectei <n><sg><gen><%:отца><@:subj><§:agent><:human><:kin><!:aef31><!:fcd32> khannatanmai: so long as the prefix could be null TinoDidriksen: wrong khannatanmai (@khannatanmai:matrix.org) of course, I hadn't planned to give the user an option as to what the secondary tags prefixes are going to be TinoDidriksen Sure, null prefix should be fine. My definition would just be any tag containing : is secondary.