Difference between revisions of "Talk:Automatically trimming a monodix"
(→TODO) |
|||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== HFST: possible to overcome compound overgeneration? == |
|||
==Implementing automatic trimming in lttoolbox== |
|||
The current method: compile the analyser in the normal way, then lt-trim loads it and loops through all its states, trying to do the same steps in parallel with the compiled bidix: |
|||
Assuming we ''don't use flags'', we can compile an HFST transducer through [[ATT]] format to a working lttoolbox transducer. |
|||
<pre> |
|||
trim(current_a, current_b): |
|||
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate: |
|||
for symbol, next_a in analyser.transitions[current_a]: |
|||
<pre>jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom></pre> |
|||
found = false |
|||
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones. |
|||
for s, next_b in bidix.transitions[current_b]: |
|||
if s==symbol: |
|||
trim(next_a, next_b) |
|||
found = true |
|||
if seen tags: |
|||
found = true |
|||
One way to overcome this is to first compile the analyser from the ATT, then: |
|||
if !found && !current_b.isFinal(): |
|||
* go through it building a new version, but on seeing a +, we: |
|||
delete symbol from analyser.transitions[current_a] |
|||
** replace the transition with a single compound-only-L tag transitioning into final state |
|||
** let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST |
|||
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added. |
|||
// else: all transitions from this point on will just be carried over unchanged by bidix |
|||
trim(analyser.initial, bidix.initial) |
|||
</pre> |
|||
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST. |
|||
[[Image:Product-automaton-intersection.png|thumb|400px|right|product automaton for intersection]] |
|||
::You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST) |
|||
Trimming while reading the XML file might have lower memory usage, but seems like more work, since pardefs are read before we get to an "initial" state. |
|||
: Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc? |
|||
A slightly different approach is to create the '''product automaton''' for intersection, marking as final only state-pairs where both parts of the state-pair are final in the original automata. Minimisation should remove unreachable state-pairs. However, this quickly blows up in memory usage since it creates ''all'' possible state pairs first (cartesian product), not just the useful ones. |
|||
::This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 13:33, 22 May 2014 (CEST) |
|||
https://github.com/unhammer/lttoolbox/branches has some experiments, see e.g. branches product-intersection and df-intersection |
|||
<br clear="all" /> |
|||
Say we have the FST "bidix", can we create "<code>bidix [^+]* (+ bidix [^+]*)*</code>" (where + is just the literal join symbol) and trim with that? |
|||
===TODO=== |
|||
: seems to work! |
|||
* Would it make things faster if we first clear out the right-sides of the bidix and minimize that? |
|||
* remove all those ifdef DEBUG's and wcerr's |
|||
* more tests? |
|||
* if no objections: git svn dcommit |
|||
==#-type multiwords== |
|||
The implementation: first "preprocess" (in memory) the bidix FST so it has the same format as the analyser, ie. instead of "take# out<vblex>" it has "take<vblex># out". |
|||
Say you have these entries in the bidix: <pre> |
|||
take# out<vblex> |
|||
take# out<n> |
|||
take# in<vblex> |
|||
take<n> |
|||
take<vblex> |
|||
</pre> |
|||
We do a depth-first traversal, and then after reading all of "take", we see t=transition(A, B, ε, #). We remove that transition, and instead add the result of copyWithTagsFirst(t), which returns a new transducer that represents these partial analyses: |
|||
<pre><vblex># out |
|||
<n># out |
|||
<vblex># in</pre> |
|||
===pseudocode=== |
|||
<pre> |
|||
def moveLemqsLast() |
|||
new_t = Transducer() |
|||
seen = set() |
|||
todo = [initial] |
|||
while todo: |
|||
src = todo.pop() |
|||
for left, right, trg in transitions[src]: |
|||
if left == '#': |
|||
new_t.transitions[src][epsilon].insertTransducer( copyWithTagsFirst(trg) ) |
|||
else: |
|||
new_t.linkStates(src, trg, left, right) |
|||
if trg not in seen: |
|||
todo.push(trg) |
|||
seen.insert(trg) |
|||
return new_t |
|||
def copyWithTagsFirst(start): |
|||
seen = set() |
|||
new_t = Transducer() |
|||
lemq = Transducer() |
|||
m = { start: new_t.initial } |
|||
finally = [] |
|||
todo = [(start,start)] |
|||
while todo: |
|||
src,lemq_last = todo.pop() |
|||
for left, right, trg in transitions[src]: |
|||
if not isTag(left): |
|||
lemq.linkStates(src, trg, label) |
|||
if trg,trg not in seen: |
|||
todo.push((trg, trg)) |
|||
else: |
|||
if src == lemq_last |
|||
new_t.linkState(m[start], m[trg], epsilon) |
|||
else: |
|||
new_t.linkState(m[src], m[trg], label) |
|||
if src in finals: |
|||
finally.insert((src, lemq_last)) |
|||
if src,lemq_last not in seen: |
|||
todo.push((trg, lemq_last)) |
|||
seen.insert((src, lemq_last)) |
|||
for lasttag, lemq_last in finally: |
|||
newlemq = Transducer(lemq) # copy lemq |
|||
newlemq.finals = set(lemq_last) # let lemq_last be the only final state in newlemq |
|||
newlemq.minimize() |
|||
new_t.insertTransducer(m[lasttag], newlemq) # append newlemq after finaltag |
|||
return new_t |
|||
</pre> |
|||
== Compounds vs trimming in HFST == |
|||
The sme.lexc can't be trimmed using the simple HFST trick, due to compounds. |
|||
Say you have '''cake n sg''', '''cake n pl''', '''beer n pl''' and '''beer n sg''' in monodix, while bidix has '''beer n''' and '''wine n'''. The HFST method without compounding is to intersect '''(cake|beer) n (sg|pl)''' with '''(beer|wine) n .*''' to get '''beer n (sg|pl)'''. |
|||
But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like |
|||
: '''((cake|beer) n sg)*(cake|beer) n (sg|pl)''' |
|||
The intersection of this with |
|||
: '''(beer|wine) n .*''' |
|||
is |
|||
: '''(beer n sg)*(cake|beer) n (sg|pl) | beer n pl''' |
|||
when it should have been |
|||
: '''(beer n sg)*(beer n (sg|pl)''' |
|||
Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing. |
|||
lt-trim is able to understand compounds by simply skipping the compund tags |
Latest revision as of 14:47, 20 June 2014
HFST: possible to overcome compound overgeneration?[edit]
Assuming we don't use flags, we can compile an HFST transducer through ATT format to a working lttoolbox transducer.
Now the issue is that compounds are plain transitions back into some lexicon, without the compound-only-L/compound-R tags, so even though lt-trim should trim the compounds correctly (by treating them like <j/> transitions), the resulting analyser will over-generate:
jīvitarēkha/jīvitarēkha<n><sg><nom>/jīvitaṁ<n><cmp>+rēkha<n><sg><nom>
Ie. we will get non-lexicalised compound analyses along with the lexicalised ones.
One way to overcome this is to first compile the analyser from the ATT, then:
- go through it building a new version, but on seeing a +, we:
- replace the transition with a single compound-only-L tag transitioning into final state
- let partial=copy_until_final(the_plus_transition) and make a transition from start into partial, and we connect the final state of partial with a single tag compound-R into the final state of our new FST
If the function copy_until_final sees a +, that transition is discarded, but a compound-only-L tag is added.
Note: this compounding method won't let you do stuff like "only allow adj adj or noun noun compounds" like you can in HFST.
- You can do that in the morphotactics though (e.g. adjective paradigms only redirect to the adjectivestem lexicon.) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
- Uh, what happens if you simply specify compound-R and compound-only-L tags in the lexc?
- This is also an option, while we were at it I'd choose a prettier couple of symbols though ;) - Francis Tyers (talk) 13:33, 22 May 2014 (CEST)
Say we have the FST "bidix", can we create "bidix [^+]* (+ bidix [^+]*)*
" (where + is just the literal join symbol) and trim with that?
- seems to work!