Difference between revisions of "Lt-trim"
Line 1: | Line 1: | ||
− | '''lt-trim''' is the application responsible for trimming compiled dictionaries. The |
+ | '''lt-trim''' is the [[lttoolbox]] application responsible for trimming compiled dictionaries. The |
analyses (right-side when compiling lr) of analyser_binary are trimmed |
analyses (right-side when compiling lr) of analyser_binary are trimmed |
||
to the input side of bidix_binary (left-side when compiling lr, |
to the input side of bidix_binary (left-side when compiling lr, |
||
Line 7: | Line 7: | ||
Both compund tags (`<compound-only-L>', `<compound-R>') and join |
Both compund tags (`<compound-only-L>', `<compound-R>') and join |
||
elements (`<j/>' in XML, `+' in the stream) and the group element |
elements (`<j/>' in XML, `+' in the stream) and the group element |
||
− | (`<g/>' in XML, `#' in the stream) should be handled correctly |
+ | (`<g/>' in XML, `#' in the stream) should be handled correctly, even |
+ | combinations of + followed by # in monodix are handled. |
||
+ | |||
+ | One minor caveat: If you have the capitalised lemma "Foo" in the |
||
+ | monodix, but "foo" in the bidix, an analysis "^Foo<tag>$" would pass |
||
+ | through bidix when doing lt-proc -b, but will not make it through |
||
+ | trimming. Make sure your lemmas have the same capitalisation in the |
||
+ | different dictionaries. |
||
You should not trim a generator unless you have a '''very''' simple |
You should not trim a generator unless you have a '''very''' simple |
||
Line 13: | Line 20: | ||
through transfer. |
through transfer. |
||
+ | ==Usage== |
||
+ | <pre>$ lt-trim analyser_binary bidix_binary trimmed_analyser_binary</pre> |
||
+ | |||
+ | E.g. to trim ca-en.automorf.bin using ca-en.autobil.bin: |
||
+ | <pre>$ lt-trim ca-en.automorf.bin ca-en.autobil.bin ca-en.automorf-trimmed.bin</pre> |
||
+ | |||
+ | ==Implementation== |
||
==See also== |
==See also== |
Revision as of 08:27, 11 February 2014
lt-trim is the lttoolbox application responsible for trimming compiled dictionaries. The analyses (right-side when compiling lr) of analyser_binary are trimmed to the input side of bidix_binary (left-side when compiling lr, right-side when compiling rl), such that only analyses which would pass through `lt-proc -b bidix_binary' are kept.
Both compund tags (`<compound-only-L>', `<compound-R>') and join elements (`<j/>' in XML, `+' in the stream) and the group element (`<g/>' in XML, `#' in the stream) should be handled correctly, even combinations of + followed by # in monodix are handled.
One minor caveat: If you have the capitalised lemma "Foo" in the monodix, but "foo" in the bidix, an analysis "^Foo<tag>$" would pass through bidix when doing lt-proc -b, but will not make it through trimming. Make sure your lemmas have the same capitalisation in the different dictionaries.
You should not trim a generator unless you have a very simple translator pipeline, since the output of bidix seldom goes unchanged through transfer.
Usage
$ lt-trim analyser_binary bidix_binary trimmed_analyser_binary
E.g. to trim ca-en.automorf.bin using ca-en.autobil.bin:
$ lt-trim ca-en.automorf.bin ca-en.autobil.bin ca-en.automorf-trimmed.bin