Difference between revisions of "Talk:Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
  +
== HFST: possible to overcome compound overgeneration? ==
  +
  +
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer.
  +
  +
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim them compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:
  +
  +
<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre>
  +
  +
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.
  +
  +
One way to overcome this is to first compile the analyser from the ATT, then:
  +
* go through it building a new version, but on seeing a +, we:
  +
** replace the transition with a single compound-only-L tag transitioning into final state
  +
** let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST
  +
  +
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.
  +
  +
  +
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.

Revision as of 09:48, 22 May 2014

HFST: possible to overcome compound overgeneration?

Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.

Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim them compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:

jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom>

Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.

One way to overcome this is to first compile the analyser from the ATT, then:

  • go through it building a new version, but on seeing a +, we:
    • replace the transition with a single compound-only-L tag transitioning into final state
    • let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST

If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.


Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.