Difference between revisions of "User talk:Khannatanmai/New Apertium stream format"

From Apertium
Jump to navigation Jump to search
(how to implement the querying of secondary tags)
(No difference)

Revision as of 09:31, 1 April 2020

Implementation of querying of secondary tags (1 April 2020)

[07:56:09] <TinoDidriksen> khannatanmai, there's also _multimap, but order will matter. At minimum, the input order must be preserved, and eventually people will want to query these things where order matters. [07:57:09] <TinoDidriksen> That was one of the fundamental design flaws I made in CG-3. I figured that surely tag order wouldn't matter in queries, but there are uses for it. [07:58:32] <khannatanmai> yeah true [08:00:49] <khannatanmai> apparently LinkedListMultimap preserves insertion order [08:02:01] <khannatanmai> and for the same key I think it already preserves insertion order [08:04:09] <TinoDidriksen> I haven't fully fixed CG-3's storage yet, but what I had in mind was unordered_multimap<Tag,size_t> so that an existence check can still be done as O(1) hash lookup, but input is preserved in the size_t. [08:05:59] <TinoDidriksen> Have to preserve input order even when groups are mixed. <t:a><y:b><t:c> [08:06:37] <khannatanmai> don't you think even a normal map would work for an existence check? unless ofc if existence check could also mean existence of multiple duplicate keys [08:07:14] <khannatanmai> oh wait you're talking about putting the whole tag in the map? [08:07:33] <TinoDidriksen> For this number of tags, sure. A flat map would probably be best performance. [08:07:54] <TinoDidriksen> As a rule, don't use linked lists for anything. They're just bad. [08:11:12] <khannatanmai> what exactly do you mean by "input is preserved in the size_t" ? [08:11:24] <khannatanmai> and in your map, Tag is the whole tag or just the prefix [08:11:43] <TinoDidriksen> Whole tag, mapped to input position. [08:12:01] <khannatanmai> ah input position, alright [08:12:30] <khannatanmai> how about an array to preserve prefix order and an unordered multimap to map prefix to values [08:12:35] <TinoDidriksen> So <t:a><y:b><t:c> would be stored as t:a -> 1, t:c ->3, y:b -> 2 [08:12:51] <TinoDidriksen> Not prefix order. Tag order. [08:13:46] <khannatanmai> if you preserve prefix order, and for duplicate prefixes you preserve the order of values, that amounts to preserving tag order right [08:13:51] <TinoDidriksen> Nope [08:14:48] <khannatanmai> So <t:a><y:b><t:c> would become [t,y,t], [t->a,c ; y->b] [08:15:05] <TinoDidriksen> How do you get <t:a><y:b><t:c> back out from that? [08:15:15] <TinoDidriksen> Oh, like that. [08:15:27] <TinoDidriksen> Sure, that could work. [08:16:27] <khannatanmai> the existence check would still come from the unordered_map, plus the usual benefit of getting value from key [08:17:24] <TinoDidriksen> Existence would be for the whole tag. [08:17:49] <khannatanmai> oh I was thinking it as checking for existence of a prefix [08:17:59] <khannatanmai> *of it [08:18:40] <khannatanmai> i.e. is the surface form in this LU [08:19:31] <TinoDidriksen> Sure, that's also useful. [08:20:04] *** Joins: Weizhe (~Weizhe@117.136.107.66) [08:20:19] <khannatanmai> where would one check for the existence of a whole tag? (im assuming we're only talking about the secondary tags) [08:21:01] <khannatanmai> so the check would be for a predix-value pair, yeah I can see that [08:21:15] <khannatanmai> *prefix [08:21:35] <TinoDidriksen> I'm thinking in CG terms. You usually query whole tags. [08:22:18] <khannatanmai> Do CG tags have a feature-value pair? [08:22:22] <TinoDidriksen> But for many tools it would be just passing along a string with all secondary information as-is, since they don't need to query it. [08:24:02] <khannatanmai> yup [08:24:03] <TinoDidriksen> CG is pretty flexible. A tag is anything you want it to be. Some patterns are further blessed with extra behavior. [08:24:52] <khannatanmai> so when we talk about querying here, we're only talking about secondary info right? cause don't want to mess with how the programs already use the primary tags [08:28:20] <TinoDidriksen> Yup. I don't see how that would change in either case. [08:28:35] <TinoDidriksen> If people ask for t:.* it's pretty clear what they want. [08:30:23] <khannatanmai> yeah I just meant it seems like everything can be done by querying prefixes and getting values, rather than querying whole tags. even querying whole tags could be divided into querying prefix and checking value [08:30:58] <khannatanmai> I mean I guess this is querying whole tags in a way [08:31:26] <TinoDidriksen> Certainly. I just don't see what you gain by splitting the storage. To me, that says someone would want to just look at values, which makes no sense in my head. [08:33:13] <khannatanmai> the values is where the information is right? the prefixes will be known beforehand. i.e. i enter a prefix and it gives me all the values associated to it, and then i use them for whatever purpose [08:33:57] <TinoDidriksen> Yeah, that could be something people want to do... [08:38:48] <TinoDidriksen> Well, now I'm thinking of efficient storage. flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} - string_view into some larger shared storage to prevent multiple allocations. That would be amortized 1 allocation per reading, which is hard to beat. [08:48:13] <khannatanmai> and I'll be able to query this by saying something like my tag is {t, .*} ? [08:51:40] <TinoDidriksen> lower_bound() for {t, null} would find the first tag that has prefix t in O(log n) time, but given we assume there's few of these and flat storage, this will be a very low constant. [08:52:20] <TinoDidriksen> Then increment the iterator until prefix no longer matches. [08:54:12] <khannatanmai> the reason I was proposing separate storage, is that if I use an array to preserve position, and a multimap to map prefix->value, I don't need to use the array if I don't care about the position, and the lookup will be really fast [08:54:34] <khannatanmai> so if I just want the surface form, I'll just look for values for the 'sf' key in the multimap, get them and be done with it [08:55:03] <khannatanmai> this is of course assuming that order is something not everyone would want, or that it would be less needed than just a value query [08:55:35] <TinoDidriksen> Almost nobody will want the order, but it must be preserved and available. [08:56:15] <khannatanmai> yeah that's why the array is there, to preserve it and reconstruct tag order if needed [08:57:03] <khannatanmai> I just think if most of the use is going to be getting values given prefix, then that's what it should make the fastest [08:59:13] <TinoDidriksen> Yeah, but I'm sure sure this is where the real world wins over algorithmic perfection. Split storage would on paper be faster, but flat map would in reality be faster. CPU caches and indirection makes a big difference. But it should be pretty easy to work out, both on paper and with a quick test program. [09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write. [09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way