User talk:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Implementation of querying of secondary tags (1 April 2020)

[07:56:09] <TinoDidriksen> khannatanmai, there's also _multimap, but order will matter. At minimum, the input order must be preserved, and eventually people will want to query these things where order matters.
[07:57:09] <TinoDidriksen> That was one of the fundamental design flaws I made in CG-3. I figured that surely tag order wouldn't matter in queries, but there are uses for it.
[07:58:32] <khannatanmai> yeah true
[08:00:49] <khannatanmai> apparently LinkedListMultimap preserves insertion order
[08:02:01] <khannatanmai> and for the same key I think it already preserves insertion order
[08:04:09] <TinoDidriksen> I haven't fully fixed CG-3's storage yet, but what I had in mind was unordered_multimap<Tag,size_t> so that an existence check can still be done as O(1) hash lookup, but input is preserved in the size_t.
[08:05:59] <TinoDidriksen> Have to preserve input order even when groups are mixed. <t:a><y:b><t:c>
[08:06:37] <khannatanmai> don't you think even a normal map would work for an existence check? unless ofc if existence check could also mean existence of multiple duplicate keys
[08:07:14] <khannatanmai> oh wait you're talking about putting the whole tag in the map?
[08:07:33] <TinoDidriksen> For this number of tags, sure. A flat map would probably be best performance.
[08:07:54] <TinoDidriksen> As a rule, don't use linked lists for anything. They're just bad.
[08:11:12] <khannatanmai> what exactly do you mean by "input is preserved in the size_t" ?
[08:11:24] <khannatanmai> and in your map, Tag is the whole tag or just the prefix
[08:11:43] <TinoDidriksen> Whole tag, mapped to input position.
[08:12:01] <khannatanmai> ah input position, alright
[08:12:30] <khannatanmai> how about an array to preserve prefix order and an unordered multimap to map prefix to values
[08:12:35] <TinoDidriksen> So <t:a><y:b><t:c> would be stored as t:a -> 1, t:c ->3, y:b -> 2
[08:12:51] <TinoDidriksen> Not prefix order. Tag order.
[08:13:46] <khannatanmai> if you preserve prefix order, and for duplicate prefixes you preserve the order of values, that amounts to preserving tag order right
[08:13:51] <TinoDidriksen> Nope
[08:14:48] <khannatanmai> So <t:a><y:b><t:c> would become [t,y,t], [t->a,c ; y->b]
[08:15:05] <TinoDidriksen> How do you get <t:a><y:b><t:c> back out from that?
[08:15:15] <TinoDidriksen> Oh, like that.
[08:15:27] <TinoDidriksen> Sure, that could work.
[08:16:27] <khannatanmai> the existence check would still come from the unordered_map, plus the usual benefit of getting value from key
[08:17:24] <TinoDidriksen> Existence would be for the whole tag.
[08:17:49] <khannatanmai> oh I was thinking it as checking for existence of a prefix
[08:17:59] <khannatanmai> *of it
[08:18:40] <khannatanmai> i.e. is the surface form in this LU
[08:19:31] <TinoDidriksen> Sure, that's also useful.
[08:20:04] *** Joins: Weizhe (~Weizhe@
[08:20:19] <khannatanmai> where would one check for the existence of a whole tag? (im assuming we're only talking about the secondary tags)
[08:21:01] <khannatanmai> so the check would be for a predix-value pair, yeah I can see that
[08:21:15] <khannatanmai> *prefix
[08:21:35] <TinoDidriksen> I'm thinking in CG terms. You usually query whole tags.
[08:22:18] <khannatanmai> Do CG tags have a feature-value pair?
[08:22:22] <TinoDidriksen> But for many tools it would be just passing along a string with all secondary information as-is, since they don't need to query it.
[08:24:02] <khannatanmai> yup
[08:24:03] <TinoDidriksen> CG is pretty flexible. A tag is anything you want it to be. Some patterns are further blessed with extra behavior.
[08:24:52] <khannatanmai> so when we talk about querying here, we're only talking about secondary info right? cause don't want to mess with how the programs already use the primary tags
[08:28:20] <TinoDidriksen> Yup. I don't see how that would change in either case.
[08:28:35] <TinoDidriksen> If people ask for t:.* it's pretty clear what they want.
[08:30:23] <khannatanmai> yeah I just meant it seems like everything can be done by querying prefixes and getting values, rather than querying whole tags. even querying whole tags could be divided into querying prefix and checking value
[08:30:58] <khannatanmai> I mean I guess this is querying whole tags in a way
[08:31:26] <TinoDidriksen> Certainly. I just don't see what you gain by splitting the storage. To me, that says someone would want to just look at values, which makes no sense in my head.
[08:33:13] <khannatanmai> the values is where the information is right? the prefixes will be known beforehand. i.e. i enter a prefix and it gives me all the values associated to it, and then i use them for whatever purpose
[08:33:57] <TinoDidriksen> Yeah, that could be something people want to do...
[08:38:48] <TinoDidriksen> Well, now I'm thinking of efficient storage. flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} - string_view into some larger shared storage to prevent multiple allocations. That would be amortized 1 allocation per reading, which is hard to beat.
[08:48:13] <khannatanmai> and I'll be able to query this by saying something like my tag is {t, .*} ?
[08:51:40] <TinoDidriksen> lower_bound() for {t, null} would find the first tag that has prefix t in O(log n) time, but given we assume there's few of these and flat storage, this will be a very low constant.
[08:52:20] <TinoDidriksen> Then increment the iterator until prefix no longer matches.
[08:54:12] <khannatanmai> the reason I was proposing separate storage, is that if I use an array to preserve position, and a multimap to map prefix->value, I don't need to use the array if I don't care about the position, and the lookup will be really fast
[08:54:34] <khannatanmai> so if I just want the surface form, I'll just look for values for the 'sf' key in the multimap, get them and be done with it
[08:55:03] <khannatanmai> this is of course assuming that order is something not everyone would want, or that it would be less needed than just a value query
[08:55:35] <TinoDidriksen> Almost nobody will want the order, but it must be preserved and available.
[08:56:15] <khannatanmai> yeah that's why the array is there, to preserve it and reconstruct tag order if needed
[08:57:03] <khannatanmai> I just think if most of the use is going to be getting values given prefix, then that's what it should make the fastest
[08:59:13] <TinoDidriksen> Yeah, but I'm sure sure this is where the real world wins over algorithmic perfection. Split storage would on paper be faster, but flat map would in reality be faster. CPU caches and indirection makes a big difference. But it should be pretty easy to work out, both on paper and with a quick test program.
[09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write.
[09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way


it's not able to generate the surface forms
I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
am I wrong? or is this a different issue
I thought that would work as well...
khannatanmai (
yeah I tested it a bit more this isn't a problem with the secondary tags apparently a generator needs an exact match
Well, it's the generator that needs to consume most of these tags anyway, so fine.
khannatanmai (
interestingly the generator debug mode doesnt print the secondary tags. weird. but it's definitely a more fundamental problem. I think I need to go through the generator documentation a bit more
I guess the bidix has a special functionality to let through underspecified analyses?
which is why bidix just copies the secondary tags on to the tl side
Which makes sense. Usually you don't need much to disambiguate, so why force people to write out all tags. But I don't see why the generator doesn't shortcut in the same way. Is there a cmdline flag to make it do that?
Not that it matters. The generator absolutely needs modification in any case.
khannatanmai (
can't see a helpful flag for lt-proc. I'm just thinking about why it's like this
I can see why it would reject underspecified forms, but not overspecified. It should just print when it reaches a terminal and discard the remainder. But again, I assume it's an artifact of how FSTs work...they're not very flexible.
khannatanmai (
if my bidix gives xyz<n><f> in the tl, and my monodix has xyzes -> xyz<n>, you'd still want xyzes right
yeah underspecified would make sense ofc
well it cant just reach a terminal and discard
because if xyz<n> and xyz<n><f>, both exist in the monodix then it would never reach the latter
A final terminal, then.
khannatanmai (
yeah if it's reached a terminal stage and then later the input ends and no new terminal stage comes then it should use the last terminal
im just trying to think if someone decided it's better to have just the lemma instead of trying to just assign something to an overspecified form
like I have dog->dog<n> in my monodix, and I get dog<n><pl> from the bidix lookup, and then since idk what the plural form is I'll just give a generation error
sort of makes sense I guess?
For development, sure. It should be possible to detect these issues. But for production, you probably want the most generation you can get, potentially with a mark to show it's incomplete.
khannatanmai (
incomplete often means super incorrect though. In a way they did do that by showing the lemma and a # to show we didnt find an exact match in the dictionary
I'm just wondering if a better way is to have a way to ignore secondary tags when matching, rather than modifying the generator to match overspecified forms
unless if the dix has secondary tags as well (for the future), then it shouldnt ignore them while matching
The generator has to use the secondary tags. It is what needs to deal with passed through surface forms and markup tags.
khannatanmai (
yeah it should definitely use them, I'm just saying it shouldn't use them while matching dictionary forms
anyway if we don't have secondary tags in the dictionary right now, the secondary tags aren't gonna play a part in the matching. in terms of trimming, the bidix wont have a tl output for the word so this wont be an issue, and for markup tags, the matching with surface form still happens independent of them

[19:18:00] <khannatanmai> I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
I don't understand this at all
if you don't have xyz:xy<a><b><c><z> in the transducer, then xy<a><b><c><z> won't generate xyz
this is just how transducers work
yeah but the bidix is able to do it
if you have xy<a>/uv<b> in a bidix and you give it xy<a><y> it will give you uv<b><y> and I understand it's just matching and copying the remainder tags
the bidix is matching substrings
yeah exactly
if you could do what you're trying to do
then when you generate dog<n><pl>, then you would get dog and dogs
(if your singular is dog:dog<n>)
it would be terrible for agglutinating languages
I mean if there's a match for the more specified form then ofcourse you would ignore the other matches
sort of like LRLM
ат<n><pl><px1sg><dat> would generate dozens of forms
hm, I mean, I could see something that works that way, but that's just no how FST processors work
Flammie might have further thoughts
you could get all the matches and just generate the one that generated with the longest string match
But of course that would mean dog<n><pl> would generate "dog", if we only have dog:dog<n> in the monodix, instead of "#dog"
which is why another solution is to not use secondary tags in the transducer
so we have:

make it an lrlm sort of thing and get all possible matches but generate only the longest match
ignore secondary tags while doing the matching
I believe you said you were going to put secondary tags in the monodix?
either that, or they're not something relevant to analysis and generation stages
in the future we could. if we do that then we cant just ignore them yes
maybe we could implement something like, if the current possible paths in the transducer contain a secondary tag, then don't ignore the secondary tag, otherwise ignore
but im not sure how one would do that
I wanna say for now we can say that we dont want to add secondary tags to the dixes, but a solution that deals with something like that might be nice
or we can just all decide we dont wanna add secondary tags to the dixes
(16:27:55) khannatanmai: but im not sure how one would do that
sounds like you'd need to change lt-proc and hfst-proc's algorithms
khannatanmai: I think it's safe to ignore for now
with the caveat that we may want to do something with it at some point
(16:08:25) khannatanmai: how's lockdown with kids going
ugh so much fun lol
even if we ignore them we'll have to change the algorithms a bit though. But yeah not as much as the other thing. So I guess for now we'll assume we don't have secondary tags in the dictionaries
firespeaker haha enjoy lol
Definitely assume there are no secondary tags in the generator. If they were needed there, they'd be primary.
when I was thinking of ways in which secondary tags could be useful, I thought you could use them to distinguish close senses of a word by putting that in the dictionary
but yeah I guess you could make primary tags for that
They can be in the analyser monodix and be used throughout the pipe.
the monodix is the same tho
I know
by restricting the direction?
yeah that could work
cool done then

Prefix format in Secondary Tags

(prefix matching)
but for pattern matching they should be matchable
also, i think it would be better if they just had a single char prefix like syntax or semantic tags
i don't like embedding feature=value in apertium tags
It's not feature=value
khannatanmai (
whats the difference between prefix matching and pattern matching?
so like <n><sg><gen><@subj><§agent><:human>
for the sf if people insist on it it could be with %
No, because then we have to codify ALL possible secondary tags now, which we can't do.
So that's impossible.
so like <n><sg><gen><%отца><@subj><§agent><:human>
TinoDidriksen: no we wouldn't have to
: would just indicate secondary tag
and then it's up to the language (pair) person to decide how to use it
: does indicate secondary tag...
without the prefix
khannatanmai (
how would you distinguish different types of secondary tags though?
khannatanmai: in general you woudn't need to
Then you just need an extra infix delimiter.
for the surface form thing you can use a different symbol prefix
e.g. %
And for markup tags? <t:a> <t:span> etc is nice and short.
i wouldn't use this for markup tags, but if necessary you could just put them with some other symbol
Well, <t:a:1> <t:span:5> since we'd want unique IDs.
like !
if unique ids:
khannatanmai (
overall isnt this also just using symbols as features? what's the benefit - given that we still have to make FSTs ignore them
There's no benefit. It's exactly just using symbols as namespaces.
Except it's harder to read.
khannatanmai: you don't
you can just say that any tag that starts with [^a-z1-9] is a special tag
it's easier to read
<foo:bar> is horrible
oops [^a-zA-Z0-9]
What we do is say any tag that contains : is special.
khannatanmai (
foo:bar is horrible.
just so I'm on the same page. why though?
I would say %bar !a is horrible. It's so much more noisy.
Yes, : is much nicer to read.
i disagree
i can barely read it
khannatanmai: it's hard to read... it's hard to filter out the prefixes from the values
it makes the tag streams longer
it's "ugly" in an aesthetic sense
khannatanmai (
I understand what you mean, but for someone reading the other stream will have to know what each of the symbols mean
you can tell from the values
and memorising three symbols is easy
It won't be three.
We'll run out of symbol prefixes.
i don't see why
if we run out of reasonable symbol prefixes, the whole thing needs to be redesigned
But with : tags, it won't. It'll all just work, regardless of what people do with it.
Future proofing.
if it gets to that people
to that point
people can do :sf_ ... whatever
but i don't want to encourage ugliness
khannatanmai: you can always make the two systems optional and let people choose
My whole point is to make a system NOW that will work for all future needs, and : will do that.
khannatanmai (
<n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution
the difference essentially amounts to bikeshedding
It's not bikeshedding. It's a fundamental implementation difference.
TinoDidriksen: the implementation is making the tags epsilons
khannatanmai (
<n><sg><gen><%:отца><@:subj><§:agent> you can even do this if needed in the current solution

this sort of thing would in a way make it optional?

the design part is choosing which tags to make
khannatanmai: i don't hate that as much
With :, we can once and for all say : tags are handled in this case. Without :, we need to codify all prefixes.
khannatanmai: so long as the prefix could be null
TinoDidriksen: wrong
khannatanmai (
of course, I hadn't planned to give the user an option as to what the secondary tags prefixes are going to be
Sure, null prefix should be fine. My definition would just be any tag containing : is secondary.