Difference between revisions of "Talk:Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
 
(3 intermediate revisions by 2 users not shown)
Line 3: Line 3:
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer.
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer.


Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim them compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:


<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre>
<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre>
Line 19: Line 19:
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.


::You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST)


: Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
: Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?

::This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST)


Say we have the FST "bidix", can we create "<code>bidix [^+]* (+ bidix [^+]*)*</code>" (where + is just the literal join symbol) and trim with that?
: seems to work!

Latest revision as of 14:47, 20 June 2014

HFST: possible to overcome compound overgeneration?[edit]

Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.

Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:

jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom>

Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.

One way to overcome this is to first compile the analyser from the ATT, then:

  • go through it building a new version, but on seeing a +, we:
    • replace the transition with a single compound-only-L tag transitioning into final state
    • let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST

If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.


Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.

You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)


Say we have the FST "bidix", can we create "bidix [^+]* (+ bidix [^+]*)*" (where + is just the literal join symbol) and trim with that?

seems to work!