Difference between revisions of "Lttoolbox"

From Apertium
Jump to navigation Jump to search
(Several stylistic improvements to English text)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
'''lttoolbox''' is a toolbox for lexical processing, [[morphological analysis]] and generation of words. The analysis is the process of splitting of words splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <code><n><pl></code>. The generation is the opposite process.
'''lttoolbox''' is a toolbox for lexical processing, [[morphological analysis]] and generation of words. ''Analysis'' is the process of splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <code><n><pl></code>. ''Generation'' is the opposite process.


The package is split into three programs, <code>lt-comp</code>, the compiler, <code>lt-proc</code>, the processor, and <code>lt-expand</code>, which generates all possible mappings between [[surface form]]s and [[lexical form]]s in the dictionary.
The package is split into three programs, <code>lt-comp</code>, the compiler, <code>lt-proc</code>, the processor, and <code>lt-expand</code>, which generates all possible mappings between [[surface form]]s and [[lexical form]]s in the dictionary.
Line 13: Line 13:
===Compilation===
===Compilation===
{{see-also|Compiling dictionaries}}
{{see-also|Compiling dictionaries}}
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left-to-right (<code>LR</code>), or from right-to-left (<code>RL</code>). Compiling <code>LR</code> usually creates an ''analyser'', compiling <code>RL</code> usually creates a generator.<ref>In all current linguistic packages, left to right is analysis, and right to left is generation. This is not however a software restriction.</ref>
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left to right (<code>LR</code>), or from right to left (<code>RL</code>). Compiling <code>LR</code> usually creates an ''analyser'', compiling <code>RL</code> usually creates a ''generator''.<ref>In all current linguistic packages, the left-to-right direction of compilation is ''analysis'', whereas the right-to-left direction is ''generation''. This is not, however, a software restriction.</ref>


;Example
;Example
Line 53: Line 53:
===Expansion===
===Expansion===


Sometimes you want to be able to see the complete output of the dictionary. That is, all of the mappings between lexical and surface forms. For this you can use the <code>lt-expand</code> tool. This output is often useful in finding bugs in assignment of paradigms etc.
Sometimes you want to be able to see the complete output of the dictionary &mdash; i.e., all of the mappings between lexical and surface forms. For this you can use the <code>lt-expand</code> tool. This output is often useful in finding bugs in the assignment of paradigms, etc.


;Example
;Example


The command to expand the Catalan dictionary in the <code>apertium-es-ca</code> pair, along with the first 10 lines of output, at the last count the total length of the output was over 2.3 million lines.
Here are the first ten lines that are produced as output from the command to expand the Catalan dictionary in the <code>apertium-es-ca</code> pair. (At last count, the total length of the output was over 2.3 million lines.)


<pre>
<pre>
Line 94: Line 94:
</pre>
</pre>


Try searching for empty left sides in your dictionary by using <code>lt-expand</code> and <code>grep</code>. For example in the Icelandic dictionary,
Try searching for empty left sides in your dictionary by using <code>lt-expand</code> and <code>grep</code>. For example, in the Icelandic dictionary,


<pre>
<pre>
Line 120: Line 120:
</pre>
</pre>


This means you should go and look for the "kunna" verb and see where the empty left side is, and put something there, or add something to the invariant section.
This means you should look for the "kunna" verb; where the left side is empty, you should either put something there or add something to the invariant section.


==Speed==
==Speed==
Line 162: Line 162:
==Notes==
==Notes==
<references/>
<references/>






Revision as of 22:53, 20 December 2011

lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words. Analysis is the process of splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <n><pl>. Generation is the opposite process.

The package is split into three programs, lt-comp, the compiler, lt-proc, the processor, and lt-expand, which generates all possible mappings between surface forms and lexical forms in the dictionary.

Creation

Main article: Monodix basics

Morphological analyser specification files, or morphological dictionaries may be found in all of our language pair packages, from the incubator, or you may elect to create your own (more instructions at the page Monodix basics). You can also check out our list of dictionaries, which has statistics on names, locations and number of entries of each of the dictionaries.

Usage

Compilation

See also: Compiling dictionaries

Compilation into the binary format is achieved by means of the lt-comp program. You can compile a given .dix from left to right (LR), or from right to left (RL). Compiling LR usually creates an analyser, compiling RL usually creates a generator.[1]

Example

Compile the apertium-es-ca.ca.dix dictionary in a left-to-right manner into the binary ca.bin.

$ lt-comp lr apertium-es-ca.ca.dix ca.bin

Processing

There are two main modes of use for the processor (lt-proc), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form.

Analysis

After compiling the apertium-es-ca.ca.dix file left-to-right into ca.morf.bin, we can analyse Catalan:

Example
$ echo "prova" | lt-proc ca.morf.bin

^prova/prova<n><f><sg>/provar<vblex><pri><p3><sg>/provar<vblex><imp><p2><sg>$

Generation

And compiling it right-to-left, we can generate:

Example
$ echo "^prova<n><f><pl>$"  | lt-proc -g ca.gen.bin

proves

Expansion

Sometimes you want to be able to see the complete output of the dictionary — i.e., all of the mappings between lexical and surface forms. For this you can use the lt-expand tool. This output is often useful in finding bugs in the assignment of paradigms, etc.

Example

Here are the first ten lines that are produced as output from the command to expand the Catalan dictionary in the apertium-es-ca pair. (At last count, the total length of the output was over 2.3 million lines.)

$ lt-expand apertium-es-ca.ca.dix 

abdominals:abdominal<adj><mf><pl>
abdominal:abdominal<adj><mf><sg>
absents:absent<adj><mf><pl>
absent:absent<adj><mf><sg>
absolutes:absolut<adj><f><pl>
absoluta:absolut<adj><f><sg>
absoluts:absolut<adj><m><pl>
absolut:absolut<adj><m><sg>
abstractes:abstracte<adj><mf><pl>
abstracta:abstracte<adj><f><sg>
Note

You cannot run lt-expand directly on a .dix.xml file. The .dix files in (for example) the apertium-en-af pair have their symbols in a separate file. You need to first run xmllint:

$ xmllint --xinclude apertium-en-af.af.dix.xml > apertium-en-af.af.dix

Then run lt-expand on the apertium-en-af.af.dix file.

Troubleshooting

Empty left side

If you get a message like:

Error: Invalid dictionary (hint: the left side of an entry is empty)

Try searching for empty left sides in your dictionary by using lt-expand and grep. For example, in the Icelandic dictionary,

$ lt-expand apertium-fo-is.is.dix  | grep ^:
:kunna<vblex><imp><p2><sg>
:kunna<vblex><imp><p1><pl>
:kunna<vblex><imp><p2><pl>

The empty left side will look something like:

      <e>
        <p>
          <l></l>
          <r>kunna<s n="vblex"/><s n="imp"/><s n="p2"/><s n="pl"/></r>
        </p>
      </e>

It is not possible to have an empty left side in a paradigm if you have no invariant (<i>) section in the main section entry, e.g.

    <e lm="kunna"><i></i><par n="/kunna__vblex"/></e>

This means you should look for the "kunna" verb; where the left side is empty, you should either put something there or add something to the invariant section.

Speed

$ yes word | head -10000000 > /tmp/foo

$ head /tmp/foo
word
word
word
...

$ wc -l /tmp/foo
1000000 /tmp/foo

$ time cat /tmp/foo | lt-proc en-ca.automorf.bin >/dev/null

real	0m17.606s
user	0m17.281s
sys	0m0.036s

58,823 words / second

Using as a library

See Lttoolbox API for how to analyse and generate words with lttoolbox from C++ or Python.

Wishlist

  • Being able to have multichar symbols/tags without '<' and '>'

See also

Notes

  1. In all current linguistic packages, the left-to-right direction of compilation is analysis, whereas the right-to-left direction is generation. This is not, however, a software restriction.