Difference between revisions of "Lttoolbox"
Line 12: | Line 12: | ||
===Compilation=== |
===Compilation=== |
||
{{see-also|Compiling dictionaries}} |
|||
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left-to-right (<code>LR</code>), or from right-to-left (<code>RL</code>). Compiling <code>LR</code> usually creates an ''analyser'', compiling <code>RL</code> usually creates a generator.<ref>In all current linguistic packages, left to right is analysis, and right to left is generation. This is not however a software restriction.</ref> |
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left-to-right (<code>LR</code>), or from right-to-left (<code>RL</code>). Compiling <code>LR</code> usually creates an ''analyser'', compiling <code>RL</code> usually creates a generator.<ref>In all current linguistic packages, left to right is analysis, and right to left is generation. This is not however a software restriction.</ref> |
||
Revision as of 09:24, 23 November 2009
lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words. The analysis is the process of splitting of words splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <n><pl>
. The generation is the opposite process.
The package is split into three programs, lt-comp
, the compiler, lt-proc
, the processor, and lt-expand
, which generates all possible mappings between surface forms and lexical forms in the dictionary.
Creation
- Main article: Monodix basics
Morphological analyser specification files, or morphological dictionaries may be found in all of our language pair packages, from the incubator, or you may elect to create your own (more instructions at the page Monodix basics). You can also check out our list of dictionaries, which has statistics on names, locations and number of entries of each of the dictionaries.
Usage
Compilation
- See also: Compiling dictionaries
Compilation into the binary format is achieved by means of the lt-comp
program. You can compile a given .dix
from left-to-right (LR
), or from right-to-left (RL
). Compiling LR
usually creates an analyser, compiling RL
usually creates a generator.[1]
- Example
Compile the apertium-es-ca.ca.dix
dictionary in a left-to-right manner into the binary ca.bin
.
$ lt-comp lr apertium-es-ca.ca.dix ca.bin
Processing
There are two main modes of use for the processor (lt-proc
), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form.
Analysis
After compiling the apertium-es-ca.ca.dix
file left-to-right into ca.morf.bin
, we can analyse Catalan:
- Example
$ echo "prova" | lt-proc ca.morf.bin ^prova/prova<n><f><sg>/provar<vblex><pri><p3><sg>/provar<vblex><imp><p2><sg>$
Generation
And compiling it right-to-left, we can generate:
- Example
$ echo "^prova<n><f><pl>$" | lt-proc -g ca.gen.bin proves
Expansion
Sometimes you want to be able to see the complete output of the dictionary. That is, all of the mappings between lexical and surface forms. For this you can use the lt-expand
tool. This output is often useful in finding bugs in assignment of paradigms etc.
- Example
The command to expand the Catalan dictionary in the apertium-es-ca
pair, along with the first 10 lines of output, at the last count the total length of the output was over 2.3 million lines.
$ lt-expand apertium-es-ca.ca.dix abdominals:abdominal<adj><mf><pl> abdominal:abdominal<adj><mf><sg> absents:absent<adj><mf><pl> absent:absent<adj><mf><sg> absolutes:absolut<adj><f><pl> absoluta:absolut<adj><f><sg> absoluts:absolut<adj><m><pl> absolut:absolut<adj><m><sg> abstractes:abstracte<adj><mf><pl> abstracta:abstracte<adj><f><sg>
- Note
You cannot run lt-expand directly on a .dix.xml
file. The .dix
files in (for example) the apertium-cy-en
pair have their symbols in a separate file. You need to first run xmllint
:
$ xmllint --xinclude apertium-cy-en.cy.dix.xml > apertium-cy-en.cy.dix
Then run lt-expand
on the apertium-cy-en.cy.dix
file.
Troubleshooting
- Empty left side
If you get a message like:
Error: Invalid dictionary (hint: the left side of an entry is empty)
Try searching for empty left sides in your dictionary by using lt-expand
and grep
. For example in the Icelandic dictionary,
$ lt-expand apertium-fo-is.is.dix | grep ^: :kunna<vblex><imp><p2><sg> :kunna<vblex><imp><p1><pl> :kunna<vblex><imp><p2><pl>
The empty left side will look something like:
<e> <p> <l></l> <r>kunna<s n="vblex"/><s n="imp"/><s n="p2"/><s n="pl"/></r> </p> </e>
It is not possible to have an empty left side in a paradigm if you have no invariant (<i>
) section in the main section entry, e.g.
<e lm="kunna"><i></i><par n="/kunna__vblex"/></e>
This means you should go and look for the "kunna" verb and see where the empty left side is, and put something there, or add something to the invariant section.
Speed
$ yes word | head -10000000 > /tmp/foo $ head /tmp/foo word word word ... $ wc -l /tmp/foo 1000000 /tmp/foo $ time cat /tmp/foo | lt-proc en-ca.automorf.bin >/dev/null real 0m17.606s user 0m17.281s sys 0m0.036s 58,823 words / second
See also
Notes
- ↑ In all current linguistic packages, left to right is analysis, and right to left is generation. This is not however a software restriction.