Difference between revisions of "Talk:Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
== HFST: possible to overcome compound overgeneration? ==
==Implementing automatic trimming in lttoolbox==
 
The current method: compile the analyser in the normal way, then lt-trim loads it and loops through all its states, trying to do the same steps in parallel with the compiled bidix:
 
   
  +
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer.
<pre>
 
trim(current_a, current_b):
 
   
  +
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like &lt;j/&gt; transitions), the resulting analyser will over-generate:
for symbol, next_a in analyser.transitions[current_a]:
 
   
  +
<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre>
found = false
 
   
  +
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.
for s, next_b in bidix.transitions[current_b]:
 
if s==symbol:
 
trim(next_a, next_b)
 
found = true
 
if seen tags:
 
found = true
 
   
  +
One way to overcome this is to first compile the analyser from the ATT, then:
if !found && !current_b.isFinal():
 
  +
* go through it building a new version, but on seeing a +, we:
delete symbol from analyser.transitions[current_a]
 
  +
** replace the transition with a single compound-only-L tag transitioning into final state
  +
** let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST
   
  +
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.
// else: all transitions from this point on will just be carried over unchanged by bidix
 
   
trim(analyser.initial, bidix.initial)
 
</pre>
 
   
  +
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.
[[Image:Product-automaton-intersection.png|thumb|400px|right|product automaton for intersection]]
 
   
  +
::You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST)
Trimming while reading the XML file might have lower memory usage, but seems like more work, since pardefs are read before we get to an "initial" state.
 
   
  +
: Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
A slightly different approach is to create the '''product automaton''' for intersection, marking as final only state-pairs where both parts of the state-pair are final in the original automata. Minimisation should remove unreachable state-pairs. However, this quickly blows up in memory usage since it creates ''all'' possible state pairs first (cartesian product), not just the useful ones.
 
   
  +
::This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST)
https://github.com/unhammer/lttoolbox/branches has some experiments, see e.g. branches product-intersection and df-intersection
 
   
<br clear="all" />
 
   
  +
Say we have the FST "bidix", can we create "<code>bidix [^+]* (+ bidix [^+]*)*</code>" (where + is just the literal join symbol) and trim with that?
===TODO===
 
  +
: seems to work!
* Would it make things faster if we first clear out the right-sides of the bidix and minimize that?
 
 
==#-type multiwords==
 
The implementation: first "preprocess" (in memory) the bidix FST so it has the same format as the analyser, ie. instead of "take# out&lt;vblex&gt;" it has "take&lt;vblex&gt;# out".
 
 
Say you have these entries in the bidix: <pre>
 
take# out<vblex>
 
take# out<n>
 
take# in<vblex>
 
take<n>
 
take<vblex>
 
</pre>
 
 
We do a depth-first traversal, and then after reading all of "take", we see t=transition(A, B, ε, #). We remove that transition, and instead add the result of copyWithTagsFirst(t), which returns a new transducer that represents these partial analyses:
 
<pre><vblex># out
 
<n># out
 
<vblex># in</pre>
 
 
===pseudocode===
 
<pre>
 
def moveLemqsLast()
 
new_t = Transducer()
 
seen = set()
 
todo = [initial]
 
while todo:
 
src = todo.pop()
 
for left, right, trg in transitions[src]:
 
if left == '#':
 
new_t.transitions[src][epsilon].insertTransducer( copyWithTagsFirst(trg) )
 
else:
 
new_t.linkStates(src, trg, left, right)
 
if trg not in seen:
 
todo.push(trg)
 
seen.insert(trg)
 
return new_t
 
 
def copyWithTagsFirst(start):
 
seen = set()
 
new_t = Transducer()
 
lemq = Transducer()
 
m = { start: new_t.initial }
 
finally = []
 
todo = [(start,start)]
 
while todo:
 
src,lemq_last = todo.pop()
 
for left, right, trg in transitions[src]:
 
if not isTag(left):
 
lemq.linkStates(src, trg, label)
 
if trg,trg not in seen:
 
todo.push((trg, trg))
 
else:
 
if src == lemq_last
 
new_t.linkState(m[start], m[trg], epsilon)
 
else:
 
new_t.linkState(m[src], m[trg], label)
 
if src in finals:
 
finally.insert((src, lemq_last))
 
if src,lemq_last not in seen:
 
todo.push((trg, lemq_last))
 
seen.insert((src, lemq_last))
 
for lasttag, lemq_last in finally:
 
newlemq = Transducer(lemq) # copy lemq
 
newlemq.finals = set(lemq_last) # let lemq_last be the only final state in newlemq
 
newlemq.minimize()
 
new_t.insertTransducer(m[lasttag], newlemq) # append newlemq after finaltag
 
return new_t
 
</pre>
 
 
== Compounds vs trimming in HFST ==
 
 
The sme.lexc can't be trimmed using the simple HFST trick, due to compounds.
 
 
Say you have '''cake n sg''', '''cake n pl''', '''beer n pl''' and '''beer n sg''' in monodix, while bidix has '''beer n''' and '''wine n'''. The HFST method without compounding is to intersect '''(cake|beer) n (sg|pl)''' with '''(beer|wine) n .*''' to get '''beer n (sg|pl)'''.
 
 
But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like
 
: '''((cake|beer) n sg)*(cake|beer) n (sg|pl)'''
 
The intersection of this with
 
: '''(beer|wine) n .*'''
 
is
 
: '''(beer n sg)*(cake|beer) n (sg|pl) | beer n pl'''
 
when it should have been
 
: '''(beer n sg)*(beer n (sg|pl)'''
 
 
 
Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing.
 
lt-trim is able to understand compounds by simply skipping the compund tags
 

Latest revision as of 14:47, 20 June 2014

HFST: possible to overcome compound overgeneration?[edit]

Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.

Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:

jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom>

Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.

One way to overcome this is to first compile the analyser from the ATT, then:

  • go through it building a new version, but on seeing a +, we:
    • replace the transition with a single compound-only-L tag transitioning into final state
    • let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST

If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.


Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.

You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)


Say we have the FST "bidix", can we create "bidix [^+]* (+ bidix [^+]*)*" (where + is just the literal join symbol) and trim with that?

seems to work!