User talk:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Implementation of querying of secondary tags (1 April 2020)

[07:56:09] <TinoDidriksen> khannatanmai, there's also _multimap, but order will matter. At minimum, the input order must be preserved, and eventually people will want to query these things where order matters.
[07:57:09] <TinoDidriksen> That was one of the fundamental design flaws I made in CG-3. I figured that surely tag order wouldn't matter in queries, but there are uses for it.
[07:58:32] <khannatanmai> yeah true
[08:00:49] <khannatanmai> apparently LinkedListMultimap preserves insertion order
[08:02:01] <khannatanmai> and for the same key I think it already preserves insertion order
[08:04:09] <TinoDidriksen> I haven't fully fixed CG-3's storage yet, but what I had in mind was unordered_multimap<Tag,size_t> so that an existence check can still be done as O(1) hash lookup, but input is preserved in the size_t.
[08:05:59] <TinoDidriksen> Have to preserve input order even when groups are mixed. <t:a><y:b><t:c>
[08:06:37] <khannatanmai> don't you think even a normal map would work for an existence check? unless ofc if existence check could also mean existence of multiple duplicate keys
[08:07:14] <khannatanmai> oh wait you're talking about putting the whole tag in the map?
[08:07:33] <TinoDidriksen> For this number of tags, sure. A flat map would probably be best performance.
[08:07:54] <TinoDidriksen> As a rule, don't use linked lists for anything. They're just bad.
[08:11:12] <khannatanmai> what exactly do you mean by "input is preserved in the size_t" ?
[08:11:24] <khannatanmai> and in your map, Tag is the whole tag or just the prefix
[08:11:43] <TinoDidriksen> Whole tag, mapped to input position.
[08:12:01] <khannatanmai> ah input position, alright
[08:12:30] <khannatanmai> how about an array to preserve prefix order and an unordered multimap to map prefix to values
[08:12:35] <TinoDidriksen> So <t:a><y:b><t:c> would be stored as t:a -> 1, t:c ->3, y:b -> 2
[08:12:51] <TinoDidriksen> Not prefix order. Tag order.
[08:13:46] <khannatanmai> if you preserve prefix order, and for duplicate prefixes you preserve the order of values, that amounts to preserving tag order right
[08:13:51] <TinoDidriksen> Nope
[08:14:48] <khannatanmai> So <t:a><y:b><t:c> would become [t,y,t], [t->a,c ; y->b]
[08:15:05] <TinoDidriksen> How do you get <t:a><y:b><t:c> back out from that?
[08:15:15] <TinoDidriksen> Oh, like that.
[08:15:27] <TinoDidriksen> Sure, that could work.
[08:16:27] <khannatanmai> the existence check would still come from the unordered_map, plus the usual benefit of getting value from key
[08:17:24] <TinoDidriksen> Existence would be for the whole tag.
[08:17:49] <khannatanmai> oh I was thinking it as checking for existence of a prefix
[08:17:59] <khannatanmai> *of it
[08:18:40] <khannatanmai> i.e. is the surface form in this LU
[08:19:31] <TinoDidriksen> Sure, that's also useful.
[08:20:04] *** Joins: Weizhe (~Weizhe@117.136.107.66)
[08:20:19] <khannatanmai> where would one check for the existence of a whole tag? (im assuming we're only talking about the secondary tags)
[08:21:01] <khannatanmai> so the check would be for a predix-value pair, yeah I can see that
[08:21:15] <khannatanmai> *prefix
[08:21:35] <TinoDidriksen> I'm thinking in CG terms. You usually query whole tags.
[08:22:18] <khannatanmai> Do CG tags have a feature-value pair?
[08:22:22] <TinoDidriksen> But for many tools it would be just passing along a string with all secondary information as-is, since they don't need to query it.
[08:24:02] <khannatanmai> yup
[08:24:03] <TinoDidriksen> CG is pretty flexible. A tag is anything you want it to be. Some patterns are further blessed with extra behavior.
[08:24:52] <khannatanmai> so when we talk about querying here, we're only talking about secondary info right? cause don't want to mess with how the programs already use the primary tags
[08:28:20] <TinoDidriksen> Yup. I don't see how that would change in either case.
[08:28:35] <TinoDidriksen> If people ask for t:.* it's pretty clear what they want.
[08:30:23] <khannatanmai> yeah I just meant it seems like everything can be done by querying prefixes and getting values, rather than querying whole tags. even querying whole tags could be divided into querying prefix and checking value
[08:30:58] <khannatanmai> I mean I guess this is querying whole tags in a way
[08:31:26] <TinoDidriksen> Certainly. I just don't see what you gain by splitting the storage. To me, that says someone would want to just look at values, which makes no sense in my head.
[08:33:13] <khannatanmai> the values is where the information is right? the prefixes will be known beforehand. i.e. i enter a prefix and it gives me all the values associated to it, and then i use them for whatever purpose
[08:33:57] <TinoDidriksen> Yeah, that could be something people want to do...
[08:38:48] <TinoDidriksen> Well, now I'm thinking of efficient storage. flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} - string_view into some larger shared storage to prevent multiple allocations. That would be amortized 1 allocation per reading, which is hard to beat.
[08:48:13] <khannatanmai> and I'll be able to query this by saying something like my tag is {t, .*} ?
[08:51:40] <TinoDidriksen> lower_bound() for {t, null} would find the first tag that has prefix t in O(log n) time, but given we assume there's few of these and flat storage, this will be a very low constant.
[08:52:20] <TinoDidriksen> Then increment the iterator until prefix no longer matches.
[08:54:12] <khannatanmai> the reason I was proposing separate storage, is that if I use an array to preserve position, and a multimap to map prefix->value, I don't need to use the array if I don't care about the position, and the lookup will be really fast
[08:54:34] <khannatanmai> so if I just want the surface form, I'll just look for values for the 'sf' key in the multimap, get them and be done with it
[08:55:03] <khannatanmai> this is of course assuming that order is something not everyone would want, or that it would be less needed than just a value query
[08:55:35] <TinoDidriksen> Almost nobody will want the order, but it must be preserved and available.
[08:56:15] <khannatanmai> yeah that's why the array is there, to preserve it and reconstruct tag order if needed
[08:57:03] <khannatanmai> I just think if most of the use is going to be getting values given prefix, then that's what it should make the fastest
[08:59:13] <TinoDidriksen> Yeah, but I'm sure sure this is where the real world wins over algorithmic perfection. Split storage would on paper be faster, but flat map would in reality be faster. CPU caches and indirection makes a big difference. But it should be pretty easy to work out, both on paper and with a quick test program.
[09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write.
[09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way

GENERATION ISSUES WITH SECONDARY TAGS (13 April 2020)

it's not able to generate the surface forms
I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
am I wrong? or is this a different issue
TinoDidriksen
I thought that would work as well...
khannatanmai (@khannatanmai:matrix.org)
yeah I tested it a bit more this isn't a problem with the secondary tags apparently a generator needs an exact match
TinoDidriksen
Well, it's the generator that needs to consume most of these tags anyway, so fine.
khannatanmai (@khannatanmai:matrix.org)
interestingly the generator debug mode doesnt print the secondary tags. weird. but it's definitely a more fundamental problem. I think I need to go through the generator documentation a bit more
I guess the bidix has a special functionality to let through underspecified analyses?
which is why bidix just copies the secondary tags on to the tl side
TinoDidriksen
Which makes sense. Usually you don't need much to disambiguate, so why force people to write out all tags. But I don't see why the generator doesn't shortcut in the same way. Is there a cmdline flag to make it do that?
Not that it matters. The generator absolutely needs modification in any case.
khannatanmai (@khannatanmai:matrix.org)
can't see a helpful flag for lt-proc. I'm just thinking about why it's like this
TinoDidriksen
I can see why it would reject underspecified forms, but not overspecified. It should just print when it reaches a terminal and discard the remainder. But again, I assume it's an artifact of how FSTs work...they're not very flexible.
khannatanmai (@khannatanmai:matrix.org)
if my bidix gives xyz<n><f> in the tl, and my monodix has xyzes -> xyz<n>, you'd still want xyzes right
yeah underspecified would make sense ofc
well it cant just reach a terminal and discard
because if xyz<n> and xyz<n><f>, both exist in the monodix then it would never reach the latter
TinoDidriksen
A final terminal, then.
khannatanmai (@khannatanmai:matrix.org)
yeah if it's reached a terminal stage and then later the input ends and no new terminal stage comes then it should use the last terminal
im just trying to think if someone decided it's better to have just the lemma instead of trying to just assign something to an overspecified form
like I have dog->dog<n> in my monodix, and I get dog<n><pl> from the bidix lookup, and then since idk what the plural form is I'll just give a generation error
sort of makes sense I guess?
TinoDidriksen
For development, sure. It should be possible to detect these issues. But for production, you probably want the most generation you can get, potentially with a mark to show it's incomplete.
khannatanmai (@khannatanmai:matrix.org)
incomplete often means super incorrect though. In a way they did do that by showing the lemma and a # to show we didnt find an exact match in the dictionary
I'm just wondering if a better way is to have a way to ignore secondary tags when matching, rather than modifying the generator to match overspecified forms
unless if the dix has secondary tags as well (for the future), then it shouldnt ignore them while matching
TinoDidriksen
The generator has to use the secondary tags. It is what needs to deal with passed through surface forms and markup tags.
khannatanmai (@khannatanmai:matrix.org)
yeah it should definitely use them, I'm just saying it shouldn't use them while matching dictionary forms
anyway if we don't have secondary tags in the dictionary right now, the secondary tags aren't gonna play a part in the matching. in terms of trimming, the bidix wont have a tl output for the word so this wont be an issue, and for markup tags, the matching with surface form still happens independent of them

firespeaker
[19:18:00] <khannatanmai> I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
I don't understand this at all
if you don't have xyz:xy<a><b><c><z> in the transducer, then xy<a><b><c><z> won't generate xyz
this is just how transducers work
khannatanmai
yeah but the bidix is able to do it
if you have xy<a>/uv<b> in a bidix and you give it xy<a><y> it will give you uv<b><y> and I understand it's just matching and copying the remainder tags
firespeaker
the bidix is matching substrings
yeah exactly
if you could do what you're trying to do
then when you generate dog<n><pl>, then you would get dog and dogs
(if your singular is dog:dog<n>)
it would be terrible for agglutinating languages
khannatanmai
I mean if there's a match for the more specified form then ofcourse you would ignore the other matches
sort of like LRLM
firespeaker
ат<n><pl><px1sg><dat> would generate dozens of forms
hm, I mean, I could see something that works that way, but that's just no how FST processors work
Flammie might have further thoughts
khannatanmai
you could get all the matches and just generate the one that generated with the longest string match
But of course that would mean dog<n><pl> would generate "dog", if we only have dog:dog<n> in the monodix, instead of "#dog"
which is why another solution is to not use secondary tags in the transducer
so we have:

make it an lrlm sort of thing and get all possible matches but generate only the longest match
ignore secondary tags while doing the matching
firespeaker
I believe you said you were going to put secondary tags in the monodix?
either that, or they're not something relevant to analysis and generation stages
khannatanmai
in the future we could. if we do that then we cant just ignore them yes
maybe we could implement something like, if the current possible paths in the transducer contain a secondary tag, then don't ignore the secondary tag, otherwise ignore
but im not sure how one would do that
I wanna say for now we can say that we dont want to add secondary tags to the dixes, but a solution that deals with something like that might be nice
or we can just all decide we dont wanna add secondary tags to the dixes
firespeaker
(16:27:55) khannatanmai: but im not sure how one would do that
sounds like you'd need to change lt-proc and hfst-proc's algorithms
khannatanmai: I think it's safe to ignore for now
with the caveat that we may want to do something with it at some point
(16:08:25) khannatanmai: how's lockdown with kids going
ugh so much fun lol
khannatanmai
even if we ignore them we'll have to change the algorithms a bit though. But yeah not as much as the other thing. So I guess for now we'll assume we don't have secondary tags in the dictionaries
TinoDidriksen?
firespeaker haha enjoy lol
TinoDidriksen
Definitely assume there are no secondary tags in the generator. If they were needed there, they'd be primary.
khannatanmai
when I was thinking of ways in which secondary tags could be useful, I thought you could use them to distinguish close senses of a word by putting that in the dictionary
but yeah I guess you could make primary tags for that
TinoDidriksen
They can be in the analyser monodix and be used throughout the pipe.
khannatanmai
the monodix is the same tho
TinoDidriksen
I know
khannatanmai
by restricting the direction?
yeah that could work
cool done then