Apertium separable

From Apertium
Revision as of 22:02, 31 August 2017 by Firespeaker (talk | contribs)
Jump to navigation Jump to search

Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.


Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. On Debian/Ubuntu derivatives, it is part of the nightly repo as apt-get install apertium-separable.

The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:

make install

You'll need an up-to-date version of lttoolbox and associated libraries, and zlib (debian: zlib1g-dev).

It is not currently part of distributed Apertium binaries for other distros/OSs. It is now available via the nightly repositories as the apertium-separable module.

Lexical transfer in the pipeline

lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer:
(note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.)

… | apertium-tagger -g en-es.prob |  apertium-pretransfer | lsx-proc en-es.autoseq.bin | …


Creating the lsx-dictionary

Make a dictionary file:

<dictionary type="separable">
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e>
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
    <section id="main" type="standard">
        <e lm="be late" c="llegar tarde">
            <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SAdv"/><p><l>late<t/><j/></l><r></r></p>
        <e lm="take away" c="sacar, quitar">
            <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SN"/><p><l>away<t/><j/></l><r></r></p>


  • <w/> stands for one or more alphabetic symbols
  • <t/> stands for one or more tags (multicharacter symbols).


  • <e><w/><t/><j/></e> is equivalent to any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$>
    • ^tall<adj><sint><...>$
  • <e><w/><j/></e> is equivalent to any-one-or-more-chars<adj><$>
    • ^tall<adj>$

A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/apertium-eng-spa.eng-spa.lsx


Compilation into the binary format is achieved by means of the lsx-comp program.

$ lsx-comp apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin
main@standard 61 73


Processing can be done using the lsx-proc program.

The input to lsx-proc is the output of apertium-tagger and apertium-pretransfer ,

$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin
^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$

Example usages

Example #1: A sentence in plain text,

The Aragonese took Ramiro out of a monastery and made him king.

This is the output of feeding the sentence through apertium-tagger and then apertium-pretransfer :

^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

This is the output of feeding the output above through lsx-proc with apertium-eng-spa.eng-spa.lsx:

^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

Naming Convention

apertium-eng-cat.eng-cat.lsx, eng-cat.autoseq.bin


Segmentation fault

Segmentation fault upon compilation or usage
The lsx-dictionary compiles fine with zero entries but gives a seg fault once entries are added

...no solution found yet
something is not updated or something in the makefile (?)

make sure that the makefile ...

Complaints about step_override()

svn update in lttoolbox
You'll need an up-to-date version of lttoolbox and associated libraries, and zlib (debian: zlib1g-dev).

Undefined symbol

In your dictionary you are probably using a symbol that you didn't define in the sdefs. Add the symbol to the sdefs.

Future work

  • In theory we're offloading multiwords from the transducers to lsx. This leaves open some questions:
    • how do we do N N compounds with lsx?
    • how does translation to a multiword work? In theory it's possible to invert the transducer, but an attempt to try this (—Firespeaker (talk) 00:02, 1 September 2017 (CEST)) results in a transducer that looks right but doesn't seem to be able to be processed correctly.
  • recycling dictionaries and/or paradigms? lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity?
  • Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module:
    • eng-cat
    • eng-deu (?)
    • kaz-kir

See also