Difference between revisions of "Lttoolbox"
Hectoralos (talk | contribs) |
|||
(32 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
[[Lttoolbox (français)]] |
|||
{{TOCD}} |
{{TOCD}} |
||
'''lttoolbox''' is a toolbox for lexical processing, [[morphological analysis]] and generation of words. |
'''lttoolbox''' is a toolbox for lexical processing, [[morphological analysis]] and generation of words. ''Analysis'' is the process of splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <code><n><pl></code>. ''Generation'' is the opposite process. |
||
The package is split into three programs, <code>lt-comp</code>, the compiler, <code>lt-proc</code>, the processor, and <code>lt-expand</code>, which generates all possible mappings between [[surface form]]s and [[lexical form]]s in the dictionary. |
The package is split into three programs, <code>lt-comp</code>, the compiler, <code>lt-proc</code>, the processor, and <code>lt-expand</code>, which generates all possible mappings between [[surface form]]s and [[lexical form]]s in the dictionary. |
||
==Installation== |
|||
See [[Installation]]. |
|||
==Creation== |
==Creation== |
||
{{main|Monodix basics}} |
{{main|Monodix basics}} |
||
Morphological analyser specification files, or morphological dictionaries may be found in all of our [[language pair]] packages, from the [[incubator]], or you may elect to create your own (more instructions at the page ''[[Monodix basics]]''). You can also check out our [[list of dictionaries]], which has statistics on names, locations and number of entries of each of the dictionaries. |
Morphological analyser specification files, or morphological dictionaries may be found in all of our [[List of language pairs|language pair]] packages, from the [[incubator]], or you may elect to create your own (more instructions at the page ''[[Monodix basics]]''). You can also check out our [[list of dictionaries]], which has statistics on names, locations and number of entries of each of the dictionaries. |
||
==Usage== |
==Usage== |
||
===Compilation=== |
===Compilation=== |
||
{{see-also|Compiling dictionaries}} |
|||
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left |
Compilation into the binary format is achieved by means of the <code>lt-comp</code> program. You can compile a given <code>.dix</code> from left to right (<code>LR</code>), or from right to left (<code>RL</code>). Compiling <code>LR</code> usually creates an ''analyser'', compiling <code>RL</code> usually creates a ''generator''.<ref>In all current linguistic packages, the left-to-right direction of compilation is ''analysis'', whereas the right-to-left direction is ''generation''. This is not, however, a software restriction.</ref> |
||
;Example |
;Example |
||
Line 27: | Line 33: | ||
There are two main modes of use for the processor (<code>lt-proc</code>), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form. |
There are two main modes of use for the processor (<code>lt-proc</code>), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form. |
||
{{see-also|Using an lttoolbox dictionary}} |
|||
====Analysis==== |
====Analysis==== |
||
Line 53: | Line 60: | ||
===Expansion=== |
===Expansion=== |
||
Sometimes you want to be able to see the complete output of the dictionary |
Sometimes you want to be able to see the complete output of the dictionary — i.e., all of the mappings between lexical and surface forms. For this you can use the <code>lt-expand</code> tool. This output is often useful in finding bugs in the assignment of paradigms, etc. |
||
;Example |
;Example |
||
Here are the first ten lines that are produced as output from the command to expand the Catalan dictionary in the <code>apertium-es-ca</code> pair. (At last count, the total length of the output was over 2.3 million lines.) |
|||
<pre> |
<pre> |
||
Line 74: | Line 81: | ||
</pre> |
</pre> |
||
===Printing=== |
|||
;Note |
|||
You cannot run lt-expand directly on the dix.xml file. The dix files in (for example) the cy-en pair have their symbols in a separate file. You need to first run xmllint: |
|||
<pre> |
<pre> |
||
$ xmllint --xinclude apertium-cy-en.cy.dix.xml > apertium-cy-en.cy.dix |
|||
</pre> |
|||
<dictionary> |
|||
Then run lt-expand on the apertium-cy-en.cy.dix file, redirecting the output into a text file: |
|||
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> |
|||
<sdefs> |
|||
<sdef n="n"/> |
|||
<sdef n="sg"/> |
|||
<sdef n="pl"/> |
|||
</sdefs> |
|||
<section id="main" type="standard"> |
|||
<e><p><l>beer</l><r>beer<s n="n"/><s n="sg"/></r></p></e> |
|||
<e><p><l>beers</l><r>beer<s n="n"/><s n="pl"/></r></p></e> |
|||
</section> |
|||
</dictionary> |
|||
$ lt-comp lr /tmp/test.dix /tmp/test.bin |
|||
main@standard 8 8 |
|||
$ lt-print /tmp/test.bin |
|||
0 1 b b |
|||
1 2 e e |
|||
2 3 e e |
|||
3 4 r r |
|||
4 5 ε <n> |
|||
4 6 s <n> |
|||
5 7 ε <sg> |
|||
6 7 ε <pl> |
|||
7 |
|||
<pre> |
|||
$ lt-expand apertium-cy-en.cy.dix > cy.dix.expanded.txt |
|||
</pre> |
</pre> |
||
Line 98: | Line 125: | ||
</pre> |
</pre> |
||
Try searching for empty left sides in your dictionary by using <code>lt-expand</code> and <code>grep</code>. For example in the Icelandic dictionary, |
Try searching for empty left sides in your dictionary by using <code>lt-expand</code> and <code>grep</code>. For example, in the Icelandic dictionary, |
||
<pre> |
<pre> |
||
Line 124: | Line 151: | ||
</pre> |
</pre> |
||
This means you should |
This means you should look for the "kunna" verb; where the left side is empty, you should either put something there or add something to the invariant section. |
||
;Empty left side II |
|||
If you get a message like: |
|||
<pre> |
|||
Error: Invalid dictionary (hint: the left side of an entry is empty) |
|||
</pre> |
|||
and grep doesn't find anything, then look in your pardefs. You probably have something like: |
|||
<pre> |
|||
<pardef n="na__S"> |
|||
<e> |
|||
<p> |
|||
<l></l> |
|||
<r></r> |
|||
</p> |
|||
</e> |
|||
</pre> |
|||
Add a dummy entry to make it work: |
|||
<pre> |
|||
<pardef n="na__S"> |
|||
<e> <p><l>NON_ANALYSIS</l> <r>DUE_TO_LT_PROC_HANG</r></p></e> |
|||
<e> |
|||
<p> |
|||
<l></l> |
|||
<r></r> |
|||
</p> |
|||
</e> |
|||
</pre> |
|||
(This one seems like a bug: https://sourceforge.net/p/apertium/tickets/65/ ) |
|||
;Entry beginning with whitespace |
|||
If you get a message like: |
|||
<pre> |
|||
Error: Invalid dictionary (hint: entry beginning with whitespace) |
|||
Error: Input is null - nothing to parse! |
|||
</pre> |
|||
The problem may be in a monolingual dictionary. You may have an entry like: |
|||
<pre> |
|||
<e lm="Pilato"><i> Pilato</i><par n="Juan__np"/></e> |
|||
</pre> |
|||
or |
|||
<pre> |
|||
<e lm="Pilato"><i><b/>Pilato</i><par n="Juan__np"/></e> |
|||
</pre> |
|||
Or also in the bilingual dictionary, for instance: |
|||
<pre> |
|||
<e> <p><l>cave<s n="n"/><s n="f"/></l> <r> soterrani<s n="n"/><s n="m"/></r></p></e> |
|||
</pre> |
|||
; lt-expand/lt-proc on different machines gives different output for metadix |
|||
When trying to e.g. lt-expand dictionaries with the alt-attribute (which is not valid XML according to the dix DTD), some machines might give lines with the attribute and some might skip them. If lttoolbox was compiled against an older version of libxml2, you will get the lines included, with newer libxml2 they will be skipped. To ensure you get the same output for lt-expand/lt-proc on all machines, you might have to do a "make clean; ./autogen.sh && make && make install" (after ensuring your system is up-to-date). |
|||
(In any case, metadix should not be expanded/compiled directly, run the commands on the files that have been through xsltproc, typically in .deps) |
|||
==Speed== |
|||
<pre> |
|||
$ yes word | head -10000000 > /tmp/foo |
|||
$ head /tmp/foo |
|||
word |
|||
word |
|||
word |
|||
... |
|||
$ wc -l /tmp/foo |
|||
1000000 /tmp/foo |
|||
$ time cat /tmp/foo | lt-proc en-ca.automorf.bin >/dev/null |
|||
real 0m17.606s |
|||
user 0m17.281s |
|||
sys 0m0.036s |
|||
58,823 words / second |
|||
</pre> |
|||
==Using as a library== |
|||
See [[Lttoolbox API]] for how to analyse and generate (single) words with lttoolbox from C++ or Python. |
|||
See also [[Daemon#Using_as_libraries]] on how to redirect the FILE streams for longer translation requests. |
|||
==Wishlist / TODO== |
|||
* Being able to have multichar symbols/tags without '<' and '>' |
|||
* https://scan.coverity.com/projects/1192 has some static analysis results, showing some bugs that should probably be fixed. |
|||
===Postgenerator alarms=== |
|||
Many monodixes handle postgenerator alarms <code><a/></code> explicitly in each entry. As a result, left-to-right entries have to be separated from right-to-left entries. However, it could be much easier to write a paradigm like |
|||
<pre> |
|||
<pardef n="wiggle"> |
|||
<e r="LR"> |
|||
<i/> |
|||
</e> |
|||
<e r="RL"> |
|||
<l><a/></l> |
|||
<r></r> |
|||
</e> |
|||
</pardef> |
|||
</pre> |
|||
and then invoke it in entries as follows |
|||
<pre> |
|||
<e lm="a"> |
|||
<par n="wiggle"/> |
|||
<i>a</i> |
|||
</par n="a__pr"/> |
|||
</e> |
|||
</pre> |
|||
This suggests that perhaps it would be even easier to change the meaning of <code><a/></code> so that it only works right to left, or perhaps endow it with the <code>r</code> attribute as follows: <code><a r="RL"/></code>. |
|||
==See also== |
==See also== |
||
* [[Lttoolbox/weights]] for how to make weighted transducers with lttoolbox |
|||
* [[Monodix basics]] |
* [[Monodix basics]] |
||
* [[Using an lttoolbox dictionary]] |
|||
* [[lttoolbox and lexc]] |
|||
* [[Lttoolbox-java]] |
|||
* [[Basic lttoolbox example]] |
|||
==Notes== |
==Notes== |
||
Line 134: | Line 296: | ||
[[Category:Lttoolbox|*]] |
|||
[[Category: |
[[Category:Morphological analysers]] |
||
[[Category:Documentation in English]] |
Latest revision as of 06:37, 16 February 2020
lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words. Analysis is the process of splitting a word (e.g. cats) into its lemma 'cat' and the grammatical information <n><pl>
. Generation is the opposite process.
The package is split into three programs, lt-comp
, the compiler, lt-proc
, the processor, and lt-expand
, which generates all possible mappings between surface forms and lexical forms in the dictionary.
Installation[edit]
See Installation.
Creation[edit]
- Main article: Monodix basics
Morphological analyser specification files, or morphological dictionaries may be found in all of our language pair packages, from the incubator, or you may elect to create your own (more instructions at the page Monodix basics). You can also check out our list of dictionaries, which has statistics on names, locations and number of entries of each of the dictionaries.
Usage[edit]
Compilation[edit]
- See also: Compiling dictionaries
Compilation into the binary format is achieved by means of the lt-comp
program. You can compile a given .dix
from left to right (LR
), or from right to left (RL
). Compiling LR
usually creates an analyser, compiling RL
usually creates a generator.[1]
- Example
Compile the apertium-es-ca.ca.dix
dictionary in a left-to-right manner into the binary ca.bin
.
$ lt-comp lr apertium-es-ca.ca.dix ca.bin
Processing[edit]
There are two main modes of use for the processor (lt-proc
), analysis (which is the default mode) and generation. Analysis converts surface forms into the set of possible lexical forms, while generation converts a lexical form into the corresponding surface form.
- See also: Using an lttoolbox dictionary
Analysis[edit]
After compiling the apertium-es-ca.ca.dix
file left-to-right into ca.morf.bin
, we can analyse Catalan:
- Example
$ echo "prova" | lt-proc ca.morf.bin ^prova/prova<n><f><sg>/provar<vblex><pri><p3><sg>/provar<vblex><imp><p2><sg>$
Generation[edit]
And compiling it right-to-left, we can generate:
- Example
$ echo "^prova<n><f><pl>$" | lt-proc -g ca.gen.bin proves
Expansion[edit]
Sometimes you want to be able to see the complete output of the dictionary — i.e., all of the mappings between lexical and surface forms. For this you can use the lt-expand
tool. This output is often useful in finding bugs in the assignment of paradigms, etc.
- Example
Here are the first ten lines that are produced as output from the command to expand the Catalan dictionary in the apertium-es-ca
pair. (At last count, the total length of the output was over 2.3 million lines.)
$ lt-expand apertium-es-ca.ca.dix abdominals:abdominal<adj><mf><pl> abdominal:abdominal<adj><mf><sg> absents:absent<adj><mf><pl> absent:absent<adj><mf><sg> absolutes:absolut<adj><f><pl> absoluta:absolut<adj><f><sg> absoluts:absolut<adj><m><pl> absolut:absolut<adj><m><sg> abstractes:abstracte<adj><mf><pl> abstracta:abstracte<adj><f><sg>
Printing[edit]
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <section id="main" type="standard"> <e><p><l>beer</l><r>beer<s n="n"/><s n="sg"/></r></p></e> <e><p><l>beers</l><r>beer<s n="n"/><s n="pl"/></r></p></e> </section> </dictionary> $ lt-comp lr /tmp/test.dix /tmp/test.bin main@standard 8 8 $ lt-print /tmp/test.bin 0 1 b b 1 2 e e 2 3 e e 3 4 r r 4 5 ε <n> 4 6 s <n> 5 7 ε <sg> 6 7 ε <pl> 7
Troubleshooting[edit]
- Empty left side
If you get a message like:
Error: Invalid dictionary (hint: the left side of an entry is empty)
Try searching for empty left sides in your dictionary by using lt-expand
and grep
. For example, in the Icelandic dictionary,
$ lt-expand apertium-fo-is.is.dix | grep ^: :kunna<vblex><imp><p2><sg> :kunna<vblex><imp><p1><pl> :kunna<vblex><imp><p2><pl>
The empty left side will look something like:
<e> <p> <l></l> <r>kunna<s n="vblex"/><s n="imp"/><s n="p2"/><s n="pl"/></r> </p> </e>
It is not possible to have an empty left side in a paradigm if you have no invariant (<i>
) section in the main section entry, e.g.
<e lm="kunna"><i></i><par n="/kunna__vblex"/></e>
This means you should look for the "kunna" verb; where the left side is empty, you should either put something there or add something to the invariant section.
- Empty left side II
If you get a message like:
Error: Invalid dictionary (hint: the left side of an entry is empty)
and grep doesn't find anything, then look in your pardefs. You probably have something like:
<pardef n="na__S"> <e> <p> <l></l> <r></r> </p> </e>
Add a dummy entry to make it work:
<pardef n="na__S"> <e> <p><l>NON_ANALYSIS</l> <r>DUE_TO_LT_PROC_HANG</r></p></e> <e> <p> <l></l> <r></r> </p> </e>
(This one seems like a bug: https://sourceforge.net/p/apertium/tickets/65/ )
- Entry beginning with whitespace
If you get a message like:
Error: Invalid dictionary (hint: entry beginning with whitespace) Error: Input is null - nothing to parse!
The problem may be in a monolingual dictionary. You may have an entry like:
<e lm="Pilato"><i> Pilato</i><par n="Juan__np"/></e>
or
<e lm="Pilato"><i><b/>Pilato</i><par n="Juan__np"/></e>
Or also in the bilingual dictionary, for instance:
<e> <p><l>cave<s n="n"/><s n="f"/></l> <r> soterrani<s n="n"/><s n="m"/></r></p></e>
- lt-expand/lt-proc on different machines gives different output for metadix
When trying to e.g. lt-expand dictionaries with the alt-attribute (which is not valid XML according to the dix DTD), some machines might give lines with the attribute and some might skip them. If lttoolbox was compiled against an older version of libxml2, you will get the lines included, with newer libxml2 they will be skipped. To ensure you get the same output for lt-expand/lt-proc on all machines, you might have to do a "make clean; ./autogen.sh && make && make install" (after ensuring your system is up-to-date).
(In any case, metadix should not be expanded/compiled directly, run the commands on the files that have been through xsltproc, typically in .deps)
Speed[edit]
$ yes word | head -10000000 > /tmp/foo $ head /tmp/foo word word word ... $ wc -l /tmp/foo 1000000 /tmp/foo $ time cat /tmp/foo | lt-proc en-ca.automorf.bin >/dev/null real 0m17.606s user 0m17.281s sys 0m0.036s 58,823 words / second
Using as a library[edit]
See Lttoolbox API for how to analyse and generate (single) words with lttoolbox from C++ or Python.
See also Daemon#Using_as_libraries on how to redirect the FILE streams for longer translation requests.
Wishlist / TODO[edit]
- Being able to have multichar symbols/tags without '<' and '>'
- https://scan.coverity.com/projects/1192 has some static analysis results, showing some bugs that should probably be fixed.
Postgenerator alarms[edit]
Many monodixes handle postgenerator alarms <a/>
explicitly in each entry. As a result, left-to-right entries have to be separated from right-to-left entries. However, it could be much easier to write a paradigm like
<pardef n="wiggle"> <e r="LR"> <i/> </e> <e r="RL"> <l><a/></l> <r></r> </e> </pardef>
and then invoke it in entries as follows
<e lm="a"> <par n="wiggle"/> <i>a</i> </par n="a__pr"/> </e>
This suggests that perhaps it would be even easier to change the meaning of <a/>
so that it only works right to left, or perhaps endow it with the r
attribute as follows: <a r="RL"/>
.
See also[edit]
- Lttoolbox/weights for how to make weighted transducers with lttoolbox
- Monodix basics
- Using an lttoolbox dictionary
- lttoolbox and lexc
- Lttoolbox-java
- Basic lttoolbox example
Notes[edit]
- ↑ In all current linguistic packages, the left-to-right direction of compilation is analysis, whereas the right-to-left direction is generation. This is not, however, a software restriction.