Lttoolbox

From Apertium
Jump to navigation Jump to search

lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words. The analysis is the process of splitting of words splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <n><pl>. The generation is the opposite process.

The package is split into three programs, lt-comp, the compiler, lt-proc, the processor, and lt-expand, which generates all possible mappings between surface forms and lexical forms in the dictionary.

Creation

Main article: Monodix basics

Morphological analyser specification files, or morphological dictionaries may be found in all of our language pair packages, from the incubator, or you may elect to create your own (more instructions at the page Monodix basics). You can check out our list of dictionaries if you want to get a brief inventory.

Compilation

Compilation into the binary format is achieved by means of the lt-comp program. You can compile a given .dix from left-to-right (LR), or from right-to-left (RL). Compiling LR usually creates an analyser, compiling RL usually creates a generator.[1]

Example

Compile the apertium-es-ca.ca.dix dictionary in a left-to-right manner into the binary ca.bin.

$ lt-comp lr apertium-es-ca.ca.dix ca.bin

Processing

There are two main modes of use for the processor (lt-proc), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form.

Analysis

After compiling the apertium-es-ca.ca.dix file left-to-right into ca.morf.bin, we can analyse Catalan:

Example
$ echo "prova" | lt-proc ca.morf.bin

^prova/prova<n><f><sg>/provar<vblex><pri><p3><sg>/provar<vblex><imp><p2><sg>$

Generation

And compiling it right-to-left, we can generate:

Example
$ echo "^prova<n><f><pl>$"  | lt-proc -g ca.gen.bin

proves

Expansion

Sometimes you want to be able to see the complete output of the dictionary. That is, all of the mappings between lexical and surface forms. For this you can use the lt-expand tool. This output is often useful in finding bugs in assignment of paradigms etc.

Example

The command to expand the Catalan dictionary in the apertium-es-ca pair, along with the first 10 lines of output, at the last count the total length of the output was over 2.3 million lines.

$ lt-expand apertium-es-ca.ca.dix 

abdominals:abdominal<adj><mf><pl>
abdominal:abdominal<adj><mf><sg>
absents:absent<adj><mf><pl>
absent:absent<adj><mf><sg>
absolutes:absolut<adj><f><pl>
absoluta:absolut<adj><f><sg>
absoluts:absolut<adj><m><pl>
absolut:absolut<adj><m><sg>
abstractes:abstracte<adj><mf><pl>
abstracta:abstracte<adj><f><sg>

See also

Notes

  1. In all current linguistic packages, left to right is analysis, and right to left is generation. This is not however a software restriction.