Talk:Automatically trimming a monodix
HFST: possible to overcome compound overgeneration?
Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.
One way to overcome this is to first compile the analyser from the ATT, then:
- go through it building a new version, but on seeing a +, we:
- replace the transition with a single compound-only-L tag transitioning into final state
- let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.
- Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
Say we have the FST "bidix", can we create "
bidix [^+]* (+ bidix [^+]*)*" (where + is just the literal join symbol) and trim with that?
- seems to work!