Difference between revisions of "User talk:Khannatanmai/New Apertium stream format"

From Apertium
Jump to navigation Jump to search
Line 61: Line 61:
 
[09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write.
 
[09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write.
 
[09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way
 
[09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way
  +
</pre>
  +
  +
GENERATION ISSUES WITH SECONDARY TAGS
  +
<pre>
  +
it's not able to generate the surface forms
  +
I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
  +
am I wrong? or is this a different issue
  +
TinoDidriksen
  +
I thought that would work as well...
  +
khannatanmai (@khannatanmai:matrix.org)
  +
yeah I tested it a bit more this isn't a problem with the secondary tags apparently a generator needs an exact match
  +
TinoDidriksen
  +
Well, it's the generator that needs to consume most of these tags anyway, so fine.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
interestingly the generator debug mode doesnt print the secondary tags. weird. but it's definitely a more fundamental problem. I think I need to go through the generator documentation a bit more
  +
I guess the bidix has a special functionality to let through underspecified analyses?
  +
which is why bidix just copies the secondary tags on to the tl side
  +
TinoDidriksen
  +
Which makes sense. Usually you don't need much to disambiguate, so why force people to write out all tags. But I don't see why the generator doesn't shortcut in the same way. Is there a cmdline flag to make it do that?
  +
Not that it matters. The generator absolutely needs modification in any case.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
can't see a helpful flag for lt-proc. I'm just thinking about why it's like this
  +
TinoDidriksen
  +
I can see why it would reject underspecified forms, but not overspecified. It should just print when it reaches a terminal and discard the remainder. But again, I assume it's an artifact of how FSTs work...they're not very flexible.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
if my bidix gives xyz<n><f> in the tl, and my monodix has xyzes -> xyz<n>, you'd still want xyzes right
  +
yeah underspecified would make sense ofc
  +
well it cant just reach a terminal and discard
  +
because if xyz<n> and xyz<n><f>, both exist in the monodix then it would never reach the latter
  +
TinoDidriksen
  +
A final terminal, then.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
yeah if it's reached a terminal stage and then later the input ends and no new terminal stage comes then it should use the last terminal
  +
im just trying to think if someone decided it's better to have just the lemma instead of trying to just assign something to an overspecified form
  +
like I have dog->dog<n> in my monodix, and I get dog<n><pl> from the bidix lookup, and then since idk what the plural form is I'll just give a generation error
  +
sort of makes sense I guess?
  +
TinoDidriksen
  +
For development, sure. It should be possible to detect these issues. But for production, you probably want the most generation you can get, potentially with a mark to show it's incomplete.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
incomplete often means super incorrect though. In a way they did do that by showing the lemma and a # to show we didnt find an exact match in the dictionary
  +
I'm just wondering if a better way is to have a way to ignore secondary tags when matching, rather than modifying the generator to match overspecified forms
  +
unless if the dix has secondary tags as well (for the future), then it shouldnt ignore them while matching
  +
TinoDidriksen
  +
The generator has to use the secondary tags. It is what needs to deal with passed through surface forms and markup tags.
  +
khannatanmai (@khannatanmai:matrix.org)
  +
yeah it should definitely use them, I'm just saying it shouldn't use them while matching dictionary forms
  +
anyway if we don't have secondary tags in the dictionary right now, the secondary tags aren't gonna play a part in the matching. in terms of trimming, the bidix wont have a tl output for the word so this wont be an issue, and for markup tags, the matching with surface form still happens independent of them
 
</pre>
 
</pre>

Revision as of 20:02, 12 April 2020

Implementation of querying of secondary tags (1 April 2020)

[07:56:09] <TinoDidriksen> khannatanmai, there's also _multimap, but order will matter. At minimum, the input order must be preserved, and eventually people will want to query these things where order matters.
[07:57:09] <TinoDidriksen> That was one of the fundamental design flaws I made in CG-3. I figured that surely tag order wouldn't matter in queries, but there are uses for it.
[07:58:32] <khannatanmai> yeah true
[08:00:49] <khannatanmai> apparently LinkedListMultimap preserves insertion order
[08:02:01] <khannatanmai> and for the same key I think it already preserves insertion order
[08:04:09] <TinoDidriksen> I haven't fully fixed CG-3's storage yet, but what I had in mind was unordered_multimap<Tag,size_t> so that an existence check can still be done as O(1) hash lookup, but input is preserved in the size_t.
[08:05:59] <TinoDidriksen> Have to preserve input order even when groups are mixed. <t:a><y:b><t:c>
[08:06:37] <khannatanmai> don't you think even a normal map would work for an existence check? unless ofc if existence check could also mean existence of multiple duplicate keys
[08:07:14] <khannatanmai> oh wait you're talking about putting the whole tag in the map?
[08:07:33] <TinoDidriksen> For this number of tags, sure. A flat map would probably be best performance.
[08:07:54] <TinoDidriksen> As a rule, don't use linked lists for anything. They're just bad.
[08:11:12] <khannatanmai> what exactly do you mean by "input is preserved in the size_t" ?
[08:11:24] <khannatanmai> and in your map, Tag is the whole tag or just the prefix
[08:11:43] <TinoDidriksen> Whole tag, mapped to input position.
[08:12:01] <khannatanmai> ah input position, alright
[08:12:30] <khannatanmai> how about an array to preserve prefix order and an unordered multimap to map prefix to values
[08:12:35] <TinoDidriksen> So <t:a><y:b><t:c> would be stored as t:a -> 1, t:c ->3, y:b -> 2
[08:12:51] <TinoDidriksen> Not prefix order. Tag order.
[08:13:46] <khannatanmai> if you preserve prefix order, and for duplicate prefixes you preserve the order of values, that amounts to preserving tag order right
[08:13:51] <TinoDidriksen> Nope
[08:14:48] <khannatanmai> So <t:a><y:b><t:c> would become [t,y,t], [t->a,c ; y->b]
[08:15:05] <TinoDidriksen> How do you get <t:a><y:b><t:c> back out from that?
[08:15:15] <TinoDidriksen> Oh, like that.
[08:15:27] <TinoDidriksen> Sure, that could work.
[08:16:27] <khannatanmai> the existence check would still come from the unordered_map, plus the usual benefit of getting value from key
[08:17:24] <TinoDidriksen> Existence would be for the whole tag.
[08:17:49] <khannatanmai> oh I was thinking it as checking for existence of a prefix
[08:17:59] <khannatanmai> *of it
[08:18:40] <khannatanmai> i.e. is the surface form in this LU
[08:19:31] <TinoDidriksen> Sure, that's also useful.
[08:20:04] *** Joins: Weizhe (~Weizhe@117.136.107.66)
[08:20:19] <khannatanmai> where would one check for the existence of a whole tag? (im assuming we're only talking about the secondary tags)
[08:21:01] <khannatanmai> so the check would be for a predix-value pair, yeah I can see that
[08:21:15] <khannatanmai> *prefix
[08:21:35] <TinoDidriksen> I'm thinking in CG terms. You usually query whole tags.
[08:22:18] <khannatanmai> Do CG tags have a feature-value pair?
[08:22:22] <TinoDidriksen> But for many tools it would be just passing along a string with all secondary information as-is, since they don't need to query it.
[08:24:02] <khannatanmai> yup
[08:24:03] <TinoDidriksen> CG is pretty flexible. A tag is anything you want it to be. Some patterns are further blessed with extra behavior.
[08:24:52] <khannatanmai> so when we talk about querying here, we're only talking about secondary info right? cause don't want to mess with how the programs already use the primary tags
[08:28:20] <TinoDidriksen> Yup. I don't see how that would change in either case.
[08:28:35] <TinoDidriksen> If people ask for t:.* it's pretty clear what they want.
[08:30:23] <khannatanmai> yeah I just meant it seems like everything can be done by querying prefixes and getting values, rather than querying whole tags. even querying whole tags could be divided into querying prefix and checking value
[08:30:58] <khannatanmai> I mean I guess this is querying whole tags in a way
[08:31:26] <TinoDidriksen> Certainly. I just don't see what you gain by splitting the storage. To me, that says someone would want to just look at values, which makes no sense in my head.
[08:33:13] <khannatanmai> the values is where the information is right? the prefixes will be known beforehand. i.e. i enter a prefix and it gives me all the values associated to it, and then i use them for whatever purpose
[08:33:57] <TinoDidriksen> Yeah, that could be something people want to do...
[08:38:48] <TinoDidriksen> Well, now I'm thinking of efficient storage. flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} - string_view into some larger shared storage to prevent multiple allocations. That would be amortized 1 allocation per reading, which is hard to beat.
[08:48:13] <khannatanmai> and I'll be able to query this by saying something like my tag is {t, .*} ?
[08:51:40] <TinoDidriksen> lower_bound() for {t, null} would find the first tag that has prefix t in O(log n) time, but given we assume there's few of these and flat storage, this will be a very low constant.
[08:52:20] <TinoDidriksen> Then increment the iterator until prefix no longer matches.
[08:54:12] <khannatanmai> the reason I was proposing separate storage, is that if I use an array to preserve position, and a multimap to map prefix->value, I don't need to use the array if I don't care about the position, and the lookup will be really fast
[08:54:34] <khannatanmai> so if I just want the surface form, I'll just look for values for the 'sf' key in the multimap, get them and be done with it
[08:55:03] <khannatanmai> this is of course assuming that order is something not everyone would want, or that it would be less needed than just a value query
[08:55:35] <TinoDidriksen> Almost nobody will want the order, but it must be preserved and available.
[08:56:15] <khannatanmai> yeah that's why the array is there, to preserve it and reconstruct tag order if needed
[08:57:03] <khannatanmai> I just think if most of the use is going to be getting values given prefix, then that's what it should make the fastest
[08:59:13] <TinoDidriksen> Yeah, but I'm sure sure this is where the real world wins over algorithmic perfection. Split storage would on paper be faster, but flat map would in reality be faster. CPU caches and indirection makes a big difference. But it should be pretty easy to work out, both on paper and with a quick test program.
[09:00:26] <TinoDidriksen> "sure sure"? Pretty sure, is what I wanted to write.
[09:02:47] <khannatanmai> Yeah fair enough. I don’t know enough about caches and indirection to know this but yeah running some tests might be worthwhile, just to show other people why we went one way

GENERATION ISSUES WITH SECONDARY TAGS

it's not able to generate the surface forms
I was of the opinion that if the monodix contained: xyz -> xy<a><b><c>, then during generation, xy<a><b><c><z> would generate xyz
am I wrong? or is this a different issue
TinoDidriksen
I thought that would work as well...
khannatanmai (@khannatanmai:matrix.org)
yeah I tested it a bit more this isn't a problem with the secondary tags apparently a generator needs an exact match
TinoDidriksen
Well, it's the generator that needs to consume most of these tags anyway, so fine.
khannatanmai (@khannatanmai:matrix.org)
interestingly the generator debug mode doesnt print the secondary tags. weird. but it's definitely a more fundamental problem. I think I need to go through the generator documentation a bit more
I guess the bidix has a special functionality to let through underspecified analyses?
which is why bidix just copies the secondary tags on to the tl side
TinoDidriksen
Which makes sense. Usually you don't need much to disambiguate, so why force people to write out all tags. But I don't see why the generator doesn't shortcut in the same way. Is there a cmdline flag to make it do that?
Not that it matters. The generator absolutely needs modification in any case.
khannatanmai (@khannatanmai:matrix.org)
can't see a helpful flag for lt-proc. I'm just thinking about why it's like this
TinoDidriksen
I can see why it would reject underspecified forms, but not overspecified. It should just print when it reaches a terminal and discard the remainder. But again, I assume it's an artifact of how FSTs work...they're not very flexible.
khannatanmai (@khannatanmai:matrix.org)
if my bidix gives xyz<n><f> in the tl, and my monodix has xyzes -> xyz<n>, you'd still want xyzes right
yeah underspecified would make sense ofc
well it cant just reach a terminal and discard
because if xyz<n> and xyz<n><f>, both exist in the monodix then it would never reach the latter
TinoDidriksen
A final terminal, then.
khannatanmai (@khannatanmai:matrix.org)
yeah if it's reached a terminal stage and then later the input ends and no new terminal stage comes then it should use the last terminal
im just trying to think if someone decided it's better to have just the lemma instead of trying to just assign something to an overspecified form
like I have dog->dog<n> in my monodix, and I get dog<n><pl> from the bidix lookup, and then since idk what the plural form is I'll just give a generation error
sort of makes sense I guess?
TinoDidriksen
For development, sure. It should be possible to detect these issues. But for production, you probably want the most generation you can get, potentially with a mark to show it's incomplete.
khannatanmai (@khannatanmai:matrix.org)
incomplete often means super incorrect though. In a way they did do that by showing the lemma and a # to show we didnt find an exact match in the dictionary
I'm just wondering if a better way is to have a way to ignore secondary tags when matching, rather than modifying the generator to match overspecified forms
unless if the dix has secondary tags as well (for the future), then it shouldnt ignore them while matching
TinoDidriksen
The generator has to use the secondary tags. It is what needs to deal with passed through surface forms and markup tags.
khannatanmai (@khannatanmai:matrix.org)
yeah it should definitely use them, I'm just saying it shouldn't use them while matching dictionary forms
anyway if we don't have secondary tags in the dictionary right now, the secondary tags aren't gonna play a part in the matching. in terms of trimming, the bidix wont have a tl output for the word so this wont be an issue, and for markup tags, the matching with surface form still happens independent of them