Difference between revisions of "Talk:Automatically trimming a monodix"
(Created page with '==Implementing automatic trimming in lttoolbox== The simplest method seems to be to first create the analyser in the normal way, then loop through all its states (see transducer.…') |
|||
(56 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== HFST: possible to overcome compound overgeneration? == |
|||
==Implementing automatic trimming in lttoolbox== |
|||
The simplest method seems to be to first create the analyser in the normal way, then loop through all its states (see transducer.cc:Transducer::closure for a loop example), trying to do the same steps in parallel with the compiled bidix: |
|||
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer. |
|||
<pre> |
|||
trim(current_a, current_b): |
|||
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate: |
|||
for symbol, next_a in analyser.transitions[current_a]: |
|||
<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre> |
|||
found = false |
|||
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones. |
|||
for s, next_b in bidix.transitions[current_b]: |
|||
if s==symbol: |
|||
trim(next_a, next_b, seentag) |
|||
found = true |
|||
One way to overcome this is to first compile the analyser from the ATT, then: |
|||
if !found && !current_b.isFinal(): |
|||
* go through it building a new version, but on seeing a +, we: |
|||
delete symbol from analyser.transitions[current_a] |
|||
** replace the transition with a single compound-only-L tag transitioning into final state |
|||
** let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST |
|||
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added. |
|||
// else: all transitions from this point on will just be carried over unchanged by bidix |
|||
trim(analyser.initial, bidix.initial) |
|||
</pre> |
|||
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST. |
|||
::You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST) |
|||
Trimming while reading the XML file might have lower memory usage, but seems like more work, since pardefs are read before we get to an "initial" state. |
|||
: Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc? |
|||
::This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST) |
|||
Say we have the FST "bidix", can we create "<code>bidix [^+]* (+ bidix [^+]*)*</code>" (where + is just the literal join symbol) and trim with that? |
|||
: seems to work! |
Latest revision as of 14:47, 20 June 2014
HFST: possible to overcome compound overgeneration?[edit]
Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:
jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom>
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.
One way to overcome this is to first compile the analyser from the ATT, then:
- go through it building a new version, but on seeing a +, we:
- replace the transition with a single compound-only-L tag transitioning into final state
- let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.
- You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
- Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
- This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
Say we have the FST "bidix", can we create "bidix [^+]* (+ bidix [^+]*)*
" (where + is just the literal join symbol) and trim with that?
- seems to work!