lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words. Analysis is the process of splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information
<n><pl>. Generation is the opposite process.
The package is split into three programs,
lt-comp, the compiler,
lt-proc, the processor, and
lt-expand, which generates all possible mappings between surface forms and lexical forms in the dictionary.
- Main article: Monodix basics
Morphological analyser specification files, or morphological dictionaries may be found in all of our language pair packages, from the incubator, or you may elect to create your own (more instructions at the page Monodix basics). You can also check out our list of dictionaries, which has statistics on names, locations and number of entries of each of the dictionaries.
- See also: Compiling dictionaries
Compilation into the binary format is achieved by means of the
lt-comp program. You can compile a given
.dix from left to right (
LR), or from right to left (
LR usually creates an analyser, compiling
RL usually creates a generator.
apertium-es-ca.ca.dix dictionary in a left-to-right manner into the binary
$ lt-comp lr apertium-es-ca.ca.dix ca.bin
There are two main modes of use for the processor (
lt-proc), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form.
After compiling the
apertium-es-ca.ca.dix file left-to-right into
ca.morf.bin, we can analyse Catalan:
$ echo "prova" | lt-proc ca.morf.bin ^prova/prova<n><f><sg>/provar<vblex><pri><p3><sg>/provar<vblex><imp><p2><sg>$
And compiling it right-to-left, we can generate:
$ echo "^prova<n><f><pl>$" | lt-proc -g ca.gen.bin proves
Sometimes you want to be able to see the complete output of the dictionary — i.e., all of the mappings between lexical and surface forms. For this you can use the
lt-expand tool. This output is often useful in finding bugs in the assignment of paradigms, etc.
Here are the first ten lines that are produced as output from the command to expand the Catalan dictionary in the
apertium-es-ca pair. (At last count, the total length of the output was over 2.3 million lines.)
$ lt-expand apertium-es-ca.ca.dix abdominals:abdominal<adj><mf><pl> abdominal:abdominal<adj><mf><sg> absents:absent<adj><mf><pl> absent:absent<adj><mf><sg> absolutes:absolut<adj><f><pl> absoluta:absolut<adj><f><sg> absoluts:absolut<adj><m><pl> absolut:absolut<adj><m><sg> abstractes:abstracte<adj><mf><pl> abstracta:abstracte<adj><f><sg>
You cannot run lt-expand directly on a
.dix.xml file. The
.dix files in (for example) the
apertium-en-af pair have their symbols in a separate file. You need to first run
$ xmllint --xinclude apertium-en-af.af.dix.xml > apertium-en-af.af.dix
lt-expand on the
- Empty left side
If you get a message like:
Error: Invalid dictionary (hint: the left side of an entry is empty)
Try searching for empty left sides in your dictionary by using
grep. For example, in the Icelandic dictionary,
$ lt-expand apertium-fo-is.is.dix | grep ^: :kunna<vblex><imp><p2><sg> :kunna<vblex><imp><p1><pl> :kunna<vblex><imp><p2><pl>
The empty left side will look something like:
<e> <p> <l></l> <r>kunna<s n="vblex"/><s n="imp"/><s n="p2"/><s n="pl"/></r> </p> </e>
It is not possible to have an empty left side in a paradigm if you have no invariant (
<i>) section in the main section entry, e.g.
<e lm="kunna"><i></i><par n="/kunna__vblex"/></e>
This means you should look for the "kunna" verb; where the left side is empty, you should either put something there or add something to the invariant section.
$ yes word | head -10000000 > /tmp/foo $ head /tmp/foo word word word ... $ wc -l /tmp/foo 1000000 /tmp/foo $ time cat /tmp/foo | lt-proc en-ca.automorf.bin >/dev/null real 0m17.606s user 0m17.281s sys 0m0.036s 58,823 words / second
Using as a library
See Lttoolbox API for how to analyse and generate words with lttoolbox from C++ or Python.
- Being able to have multichar symbols/tags without '<' and '>'
- Monodix basics
- Using an lttoolbox dictionary
- lttoolbox and lexc
- Basic lttoolbox example
- In all current linguistic packages, the left-to-right direction of compilation is analysis, whereas the right-to-left direction is generation. This is not, however, a software restriction.