Difference between revisions of "Lt-trim"
(GitHub migration) |
|||
(24 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
'''lt-trim''' is the [[lttoolbox]] application responsible for trimming compiled dictionaries. The |
'''lt-trim''' is the [[lttoolbox]] application responsible for trimming compiled dictionaries. The |
||
analyses (right-side when compiling lr) of analyser_binary are trimmed |
analyses (right-side when compiling lr) of analyser_binary are trimmed |
||
Line 15: | Line 16: | ||
trimming. Make sure your lemmas have the same capitalisation in the |
trimming. Make sure your lemmas have the same capitalisation in the |
||
different dictionaries. |
different dictionaries. |
||
Also, you should not have literal + or # in your lemmas. Since lt-comp |
|||
doesn't escape these, lt-trim cannot handle them either, and you may |
|||
get @-marked output this way. You can analyse + or # by having + or # |
|||
in the <l> and some other string (e.g. "plus") in the <r>. |
|||
You should not trim a generator unless you have a '''very''' simple |
You should not trim a generator unless you have a '''very''' simple |
||
translator pipeline, since the output of bidix seldom goes unchanged |
translator pipeline, since the output of bidix seldom goes unchanged |
||
through transfer. |
through transfer. |
||
Trimming is useful when monolingual data has been split out into its own package. We keep monolingual packages in the SVN module [[Languages]]. |
|||
==Installation== |
|||
This is part of the GitHub version of [[lttoolbox]], see [[Installation]]. |
|||
==Usage== |
==Usage== |
||
Line 25: | Line 37: | ||
E.g. to trim ca-en.automorf.bin using ca-en.autobil.bin: |
E.g. to trim ca-en.automorf.bin using ca-en.autobil.bin: |
||
<pre>$ lt-trim ca-en.automorf.bin ca-en.autobil.bin ca-en.automorf-trimmed.bin</pre> |
<pre>$ lt-trim ca-en.automorf.bin ca-en.autobil.bin ca-en.automorf-trimmed.bin</pre> |
||
You should now be able to type |
|||
<pre>$ echo "bon" | lt-proc ca-en.automorf-trimmed.bin</pre> |
|||
and if the form "bon" has an analysis in ca-en.automorf.bin which would also pass through ca-en.autobil.bin, you'll get the same analysis out. If not, you'll get unknown word. |
|||
==Implementation== |
==Implementation== |
||
Line 31: | Line 49: | ||
It loads analyser and bidix (named <code>trimmer</code> in the code), and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the <code>trimmed</code> analyser. |
It loads analyser and bidix (named <code>trimmer</code> in the code), and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the <code>trimmed</code> analyser. |
||
First, the bidix binary is preprocessed a bit, in memory: |
|||
==#-type multiwords== |
|||
* all sections are unioned into one big section |
|||
** see <code>Transducer::unionWith</code> |
|||
* the final state gets a "loopback" transition to itself for every tag in the monodix |
|||
** see <code>Transducer::appendDotStar</code> and <code>Alphabet::createLoopbackSymbols</code> |
|||
** This is because <code>foo<n><ind></code> should pass through <code>lt-proc -b bidix.bin</code> even if only <code>foo<n></code> is actually in bidix |
|||
* all lemq's (the # or <code><g></code> group elements) are moved so they are ''after'' the tags |
|||
** monodix always has the # part after tags, while bidix has them on the lemma |
|||
** see <code>Transducer::moveLemqsLast</code>, and the below sections on multiwords |
|||
Then, in <code>Transducer::intersect</code> we do a depth first traversal of the states of the monodix, building up a new trimmed monodix. |
|||
===Pseudocode for intersect=== |
|||
<pre> |
|||
Transducer trimmed; |
|||
m = { initial: trimmed.initial } |
|||
SearchState current = (initial, trimmer.initial, trimmer.initial) |
|||
todo = [ current ] |
|||
seen = {} |
|||
while(todo): |
|||
current = todo.pop() |
|||
seen.insert(current) |
|||
(this_src, trimmer_src, trimmer_preplus) = current |
|||
for this_trg, this_left, this_right in transitions[this_src]: |
|||
if this_right == '+': |
|||
SearchState next = (this_trg, trimmer.initial, trimmer_src) |
|||
if not next in seen: todo.push(next) |
|||
if not this_trg in m: m[this_trg]=trimmed.newState() |
|||
trimmed.linkStates(m[this_src], m[this_trg], this_right, this_left) |
|||
else: |
|||
if this_right == '#' and trimmer_preplus != trimmer_src: |
|||
# We've seen a '#' after a '+', go back in trimmer: |
|||
trimmer_src = trimmer_preplus |
|||
for trimmer_trg, trimmer_left, trimmer_right in trimmer.transitions[trimmer_src]: |
|||
if trimmer_preplus == trimmer_src: # we've not yet seen a '#' |
|||
trimmer_preplus_next = trimmer_trg |
|||
else: # we've seen a '#' |
|||
trimmer_preplus_next = trimmer_preplus |
|||
if trimmer_left==ε and this_right!=ε: |
|||
# Just skip the epsilon in trimmer: |
|||
SearchState next = (this_src, trimmer_trg, trimmer_preplus_next) |
|||
if not next in seen: todo.push(next) |
|||
elif this_right == trimmer_left or isCompoundSymbol(this_right) or this_right==ε: |
|||
if isCompoundSymbol(this_right) or this_right==ε: |
|||
# Just skip the epsilon/cmp-sym in this: |
|||
trimmer_trg = trimmer_src |
|||
SearchState next = (this_trg, trimmer_trg, trimmer_preplus_next) |
|||
if not next in seen: todo.push(next) |
|||
if not this_trg in m: m[this_trg]=trimmed.newState() |
|||
trimmed.linkStates(m[this_src], m[this_trg], this_right, this_left) |
|||
trimmed.finals = m[this.finals] |
|||
</pre> |
|||
===#-type multiwords=== |
|||
The implementation: first "preprocess" (in memory) the bidix FST so it has the same format as the analyser, ie. instead of "take# out<vblex>" it has "take<vblex># out". |
The implementation: first "preprocess" (in memory) the bidix FST so it has the same format as the analyser, ie. instead of "take# out<vblex>" it has "take<vblex># out". |
||
Line 47: | Line 130: | ||
<vblex># in</pre> |
<vblex># in</pre> |
||
====Pseudocode for #-type multiwords==== |
|||
===pseudocode=== |
|||
<pre> |
<pre> |
||
def moveLemqsLast() |
def moveLemqsLast() |
||
Line 104: | Line 187: | ||
</pre> |
</pre> |
||
== + type multiwords == |
=== + type multiwords === |
||
join (<j>) is handled by setting the next trimmer (bidix) target state to be trimmer.initial, so we simply skip to the start. |
join (<j>) is handled by setting the next trimmer (bidix) target state to be trimmer.initial, so we simply skip to the start in the bidix. ''However'', + and # together can get a bit complicated, see below. |
||
== + and # together == |
|||
=== + and # together === |
|||
''However'', if we have a + followed by a # in monodix, we need to get back to where we were after reading the + part. An example: |
|||
If we have a + followed by a # in monodix, we need to get back to where we were after reading the + part. An example: |
|||
<pre>monodix, one entry: |
<pre>monodix, one entry: |
||
foo<n><ind>+bar<pr># fie |
foo<n><ind>+bar<pr># fie |
||
Line 141: | Line 225: | ||
===TODO=== |
===TODO=== |
||
* Would it make things faster if we first clear out the right-sides of the bidix and minimize that? |
* Would it make things faster if we first clear out the right-sides of the bidix and minimize that? |
||
* remove all those ifdef DEBUG's and wcerr's |
|||
* more tests? |
* more tests? |
||
* Print an error if monodix/bidix binaries don't exist |
|||
* if no objections: git svn dcommit |
|||
* The other tools actually allow for several # in a row (though it doesn't quite make sense), should we make sure lt-trim allows that? |
|||
<pre>$ echo '^foo<v>#bar<n>#fie$' |apertium-pretransfer |
|||
^foo#bar#fie<v><n>$ |
|||
</pre> |
|||
==See also== |
==See also== |
||
* [[Languages]] |
|||
* [[Automatically trimming a monodix#lttoolbox]] |
* [[Automatically trimming a monodix#lttoolbox]] |
||
[[Category:Tools]] |
Latest revision as of 02:25, 10 March 2018
lt-trim is the lttoolbox application responsible for trimming compiled dictionaries. The analyses (right-side when compiling lr) of analyser_binary are trimmed to the input side of bidix_binary (left-side when compiling lr, right-side when compiling rl), such that only analyses which would pass through `lt-proc -b bidix_binary' are kept.
Both compund tags (`<compound-only-L>', `<compound-R>') and join elements (`<j/>' in XML, `+' in the stream) and the group element (`<g/>' in XML, `#' in the stream) should be handled correctly, even combinations of + followed by # in monodix are handled.
One minor caveat: If you have the capitalised lemma "Foo" in the monodix, but "foo" in the bidix, an analysis "^Foo<tag>$" would pass through bidix when doing lt-proc -b, but will not make it through trimming. Make sure your lemmas have the same capitalisation in the different dictionaries.
Also, you should not have literal + or # in your lemmas. Since lt-comp doesn't escape these, lt-trim cannot handle them either, and you may get @-marked output this way. You can analyse + or # by having + or # in the <l> and some other string (e.g. "plus") in the <r>.
You should not trim a generator unless you have a very simple translator pipeline, since the output of bidix seldom goes unchanged through transfer.
Trimming is useful when monolingual data has been split out into its own package. We keep monolingual packages in the SVN module Languages.
Installation[edit]
This is part of the GitHub version of lttoolbox, see Installation.
Usage[edit]
$ lt-trim analyser_binary bidix_binary trimmed_analyser_binary
E.g. to trim ca-en.automorf.bin using ca-en.autobil.bin:
$ lt-trim ca-en.automorf.bin ca-en.autobil.bin ca-en.automorf-trimmed.bin
You should now be able to type
$ echo "bon" | lt-proc ca-en.automorf-trimmed.bin
and if the form "bon" has an analysis in ca-en.automorf.bin which would also pass through ca-en.autobil.bin, you'll get the same analysis out. If not, you'll get unknown word.
Implementation[edit]
lt-trim expects regular compiled analyser and bidix binaries as input, and produces a new analyser binary as output.
It loads analyser and bidix (named trimmer
in the code), and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the trimmed
analyser.
First, the bidix binary is preprocessed a bit, in memory:
- all sections are unioned into one big section
- see
Transducer::unionWith
- see
- the final state gets a "loopback" transition to itself for every tag in the monodix
- see
Transducer::appendDotStar
andAlphabet::createLoopbackSymbols
- This is because
foo<n><ind>
should pass throughlt-proc -b bidix.bin
even if onlyfoo<n>
is actually in bidix
- see
- all lemq's (the # or
<g>
group elements) are moved so they are after the tags- monodix always has the # part after tags, while bidix has them on the lemma
- see
Transducer::moveLemqsLast
, and the below sections on multiwords
Then, in Transducer::intersect
we do a depth first traversal of the states of the monodix, building up a new trimmed monodix.
Pseudocode for intersect[edit]
Transducer trimmed; m = { initial: trimmed.initial } SearchState current = (initial, trimmer.initial, trimmer.initial) todo = [ current ] seen = {} while(todo): current = todo.pop() seen.insert(current) (this_src, trimmer_src, trimmer_preplus) = current for this_trg, this_left, this_right in transitions[this_src]: if this_right == '+': SearchState next = (this_trg, trimmer.initial, trimmer_src) if not next in seen: todo.push(next) if not this_trg in m: m[this_trg]=trimmed.newState() trimmed.linkStates(m[this_src], m[this_trg], this_right, this_left) else: if this_right == '#' and trimmer_preplus != trimmer_src: # We've seen a '#' after a '+', go back in trimmer: trimmer_src = trimmer_preplus for trimmer_trg, trimmer_left, trimmer_right in trimmer.transitions[trimmer_src]: if trimmer_preplus == trimmer_src: # we've not yet seen a '#' trimmer_preplus_next = trimmer_trg else: # we've seen a '#' trimmer_preplus_next = trimmer_preplus if trimmer_left==ε and this_right!=ε: # Just skip the epsilon in trimmer: SearchState next = (this_src, trimmer_trg, trimmer_preplus_next) if not next in seen: todo.push(next) elif this_right == trimmer_left or isCompoundSymbol(this_right) or this_right==ε: if isCompoundSymbol(this_right) or this_right==ε: # Just skip the epsilon/cmp-sym in this: trimmer_trg = trimmer_src SearchState next = (this_trg, trimmer_trg, trimmer_preplus_next) if not next in seen: todo.push(next) if not this_trg in m: m[this_trg]=trimmed.newState() trimmed.linkStates(m[this_src], m[this_trg], this_right, this_left) trimmed.finals = m[this.finals]
#-type multiwords[edit]
The implementation: first "preprocess" (in memory) the bidix FST so it has the same format as the analyser, ie. instead of "take# out<vblex>" it has "take<vblex># out".
Say you have these entries in the bidix:
take# out<vblex> take# out<n> take# in<vblex> take<n> take<vblex>
We do a depth-first traversal, and then after reading all of "take", we see t=transition(A, B, ε, #). We remove that transition, and instead add the result of copyWithTagsFirst(t), which returns a new transducer that represents these partial analyses:
<vblex># out <n># out <vblex># in
Pseudocode for #-type multiwords[edit]
def moveLemqsLast() new_t = Transducer() seen = set() todo = [initial] while todo: src = todo.pop() for left, right, trg in transitions[src]: if left == '#': new_t.transitions[src][epsilon].insertTransducer( copyWithTagsFirst(trg) ) else: new_t.linkStates(src, trg, left, right) if trg not in seen: todo.push(trg) seen.insert(trg) return new_t def copyWithTagsFirst(start): seen = set() new_t = Transducer() lemq = Transducer() # not shown below since it's quite trivial: states in new_t/lemq will get # different indices from corresponding states in this, so we need to record # those correspondences in a table in order to use the right one: states_this_new = { start: new_t.initial } states_this_lemq = { start: lemq.initial } # In the first loop, "finally" is filled with states from lemq and # new_t that need to be reconnected: finally = [] todo = [(start,start)] while todo: src,lemq_last = todo.pop() for left, right, trg in transitions[src]: if not isTag(left): lemq.linkStates(src, trg, label) if trg,trg not in seen: todo.push((trg, trg)) else: if src == lemq_last new_t.linkState(start, trg, epsilon) else: new_t.linkState(src, trg, label) if src in finals: finally.insert((src, lemq_last)) if src,lemq_last not in seen: todo.push((trg, lemq_last)) seen.insert((src, lemq_last)) # lemq_last was the last non-tag connected to the tag sequence that lead to lasttag: for lasttag, lemq_last in finally: newlemq = Transducer(lemq) # copy lemq newlemq.finals = set(lemq_last) # let lemq_last be the only final state in newlemq newlemq.minimize() new_t.insertTransducer(lasttag, newlemq) # append newlemq after finaltag return new_t
+ type multiwords[edit]
join (<j>) is handled by setting the next trimmer (bidix) target state to be trimmer.initial, so we simply skip to the start in the bidix. However, + and # together can get a bit complicated, see below.
+ and # together[edit]
If we have a + followed by a # in monodix, we need to get back to where we were after reading the + part. An example:
monodix, one entry: foo<n><ind>+bar<pr># fie bidix, two entries: foo# fie<n> bar<pr> preprocessed (prefixed+lemq-moved) bidix, two entries: foo<n>[<n>|<ind>|<pr>]*# fie bar<pr>[<n>|<ind>|<pr>]*
We start by matching on these paths:
monodix string foo<n><ind> + bar<pr> # fie monodix states g i j k bidix string foo<n>[<n>|<ind>|<pr>]* # fie bidix states l m
and get as far as state pair (i,m), having consumed
foo<n><ind>
before seeing a +.
Now, the next SearchState records both the next monodix state (j, the state after +), the next bidix state (l, the bidix initial state), and the bidix state we were in when we saw the plus (m, at the end of the bidix tag list). So in the code SearchState next = (this_trg, trimmer.initial, trimmer_src).
At (i,l,m) we've consumed
foo<n><ind>
and have
bar<pr># fie
left, and can read from bidix initial l:
monodix string bar<pr> # fie monodix states j k __bidix string bar<n>[<n>|<ind>|<pr>]* __bidix states l n
We match all of
bar<pr>
and then see a # in monodix, giving SearchState (k,n,m). Since both the third part of SearchState is set to state m (where m!=n), and we've seen a #, we jump back to that state m, from where we can match # fie
, and we're done.
TODO[edit]
- Would it make things faster if we first clear out the right-sides of the bidix and minimize that?
- more tests?
- Print an error if monodix/bidix binaries don't exist
- The other tools actually allow for several # in a row (though it doesn't quite make sense), should we make sure lt-trim allows that?
$ echo '^foo<v>#bar<n>#fie$' |apertium-pretransfer ^foo#bar#fie<v><n>$