Difference between revisions of "Talk:Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
(worse is better)
Line 33: Line 33:
 
==Unbalanced loops==
 
==Unbalanced loops==
 
Say analyser is j+<n> while bidix is j+jjj<n> – ideally we could "trim" analyser to j+jjj<n>, but is it possible to do that in a tractable manner?
 
Say analyser is j+<n> while bidix is j+jjj<n> – ideally we could "trim" analyser to j+jjj<n>, but is it possible to do that in a tractable manner?
  +
  +
==Worse is better==
  +
If it's difficult to get a complete lt-trim solution that handles &lt;g&gt; and &lt;j&gt; perfectly a stop-gap might be to distribute a dictionary of the multiwords with the language pair, and leave only "simple" words in the monolingual package dictionary. The multiwords would be manually trimmed to the pair, the simple words trimmed with lt-trim, and a new <code>lt-merge</code> command would merge the two compiled dictionaries (as seperate sections).
  +
  +
(An lt-merge command might also be helpful when compiling att transducers, which makes anything that looks like punctuation [[inconditional]], and anything else standard.)

Revision as of 17:08, 16 October 2013

Implementing automatic trimming in lttoolbox

The simplest method seems to be to first create the analyser in the normal way, then loop through all its states (see transducer.cc:Transducer::closure for a loop example), trying to do the same steps in parallel with the compiled bidix:

trim(current_a, current_b):

  for symbol, next_a in analyser.transitions[current_a]:

    found = false

    for s, next_b in bidix.transitions[current_b]:
      if s==symbol:
         trim(next_a, next_b)
         found = true
    if seen tags: 
      found = true

    if !found && !current_b.isFinal():
      delete symbol from analyser.transitions[current_a]

    // else: all transitions from this point on will just be carried over unchanged by bidix

trim(analyser.initial, bidix.initial)


Trimming while reading the XML file might have lower memory usage, but seems like more work, since pardefs are read before we get to an "initial" state.


https://github.com/unhammer/lttoolbox/branches has some experiments


Unbalanced loops

Say analyser is j+<n> while bidix is j+jjj<n> – ideally we could "trim" analyser to j+jjj<n>, but is it possible to do that in a tractable manner?

Worse is better

If it's difficult to get a complete lt-trim solution that handles <g> and <j> perfectly a stop-gap might be to distribute a dictionary of the multiwords with the language pair, and leave only "simple" words in the monolingual package dictionary. The multiwords would be manually trimmed to the pair, the simple words trimmed with lt-trim, and a new lt-merge command would merge the two compiled dictionaries (as seperate sections).

(An lt-merge command might also be helpful when compiling att transducers, which makes anything that looks like punctuation inconditional, and anything else standard.)