Apertium separable

From Apertium
Jump to navigation Jump to search

Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.


Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. On Debian/Ubuntu derivatives, it is part of the nightly repo as apt-get install apertium-separable.

The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:


It is not currently part of distributed Apertium binaries for other distros/OSs.

Lexical transfer in the pipeline

lsx-proc runs between apertium-tagger and apertium-pretransfer:

… | apertium-tagger -g eng.prob | lsx-proc english.bin | apertium-pretransfer | …


A sentence in plain text,

Thus, it was asserted that a tax on foreign workers would reduce the numbers coming in and “taking jobs away” from American citizens.

This is the output of feeding the sentence through apertium-tagger :

^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$

This is the output of feeding the output above through lsx-proc :

^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take# away<vblex><sep><ger>$ ^job<n><pl>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$

Compilation and Usage

Make a dictionary file:

<dictionary type="separable">
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e>
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
    <section id="main" type="standard">
        <e lm="be late" c="llegar tarde">
            <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SAdv"/><p><l>late<t/><j/></l><r></r></p>
        <e lm="take away" c="sacar, quitar">
            <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SN"/><p><l>away<t/><j/></l><r></r></p>


  • <w/> stands for one or more alphabetic symbols
  • <t/> stands for one or more tags (multicharacter symbols).


  • <e><w/><t/><j/></e> is equivalent to any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$>
    • ^tall<adj><sint><...>$
  • <e><w/><j/></e> is equivalent to any-one-or-more-chars<adj><$>
    • ^tall<adj>$

Then compile it:

$ lsx-comp dictionary.xml english.bin
main@standard 61 73

The input to lsx-proc is the output of apertium-tagger ,

$ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin

A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/new-example.dix

Preparedness of languages

Languages that beta-testing the module:

  • eng

Todo and bugs

  • Decide whether the lsx module is part of monolingual modules, language pairs, either, or both.
  • Instead of dictionary.xml and english.bin and the like, we should have standardised naming conventions. Some options/proposals:
    • eng-cat.autolsx.xml, eng-cat.autolsx.bin
    • eng-cat.autosep.lsx, eng-cat.autosep.bin
    • apertium-eng-cat.eng-cat.lsx, eng-cat.autoseq.bin
  • blow# out of the water, be# oppose to
  • 10:53 firespeaker: pektii: if we offload multiwords from the transducers to lsx, (1) how do we do N N compounds with lsx? (2) how does translation *to* a multiword work?
  • documentation for using lsx
        <e lm="snjúgva seg um" c="">
           <p><l>snjúgva<s n="vblex"/></l><r>snjúgva<g><b/>seg</g><s n="vblex"/></r></p>
           <p><l>seg<s n="prn"/><t/><j/>um<s n="pr"/></l><r>um<s n="pr"/></r></p>

$ lt-print fao-nob.autoseq.bin
0	1	s	s	
1	2	n	n	
2	3	j	j	
3	4	ú	ú	
4	5	g	g	
5	6	v	v	
6	7	a	a	
7	8	<vblex>	#	
8	9	ε	 	
9	10	ε	s	
10	11	ε	e	
11	12	ε	g	
12	13	ε	<vblex>	
13	14	<ANY_TAG>	<ANY_TAG>	
14	14	<ANY_TAG>	<ANY_TAG>	
14	15	<$>	<$>	
15	16	s	u	
16	17	e	m	
17	18	g	<pr>	
18	19	<prn>	ε	
19	20	<ANY_TAG>	ε	
20	21	<$>	ε	
21	22	u	ε	
22	23	m	ε	
23	24	<pr>	ε	
24	25	<$>	<$>	

$ echo "^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin 
^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$

        <e lm="halda fram, at" c="">
           <p><l>halda<s n="vblex"/></l><r>halda<g><b/>fram</g><s n="vblex"/></r></p>
           <p><l>fram<s n="adv"/><j/>,<s n="cm"/><j/>at<s n="cnjsub"/></l><r>,<s n="cm"/><j/>at<s n="cnjsub"/><j/></r></p>

$ echo "^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin 
^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><inf>$^,<cm>$ ^at<cnjsub>$ ^$

Resolved issues

  • kaz-eng
$ echo "хабар еткен" | apertium-destxt | apertium -f none -d . kaz-eng-tagger | ~/source/apertium/branches/apertium-separable/src/lsx-proc kaz-eng.autoseq.bin 
 ^хабарет<v><tv>$ ^хабарет<v><tv><past>$^хабарет<v><tv><past><p3>$^хабарет<v><tv><past><p3><sg>$^.<sent>$[][
  • kaz-kir

15:35 firespeaker: http://svn.code.sf.net/p/apertium/svn/nursery/apertium-kaz-kir/apertium-kaz-kir.kaz-kir.lsx
15:35 firespeaker: with input ^абай<adj>$ ^бол<v><iv><imp><p2><sg>$

  • deu

wolfgangth Hi, I tested the new module for reordering separable multiwords and I have a problem if one of the entries (the last) has more then one word
wolfgangth before lsx-proc : ^heute Nachmittag<adv>$ wolfgangth after lsx-proc : ^heuteNachmittag<adv>$
wolfgangth the blank was lost if it was part of a rule that was executed

  • /p/apertium/svn/incubator/apertium-fao-nor/apertium-fao-nor.fao-nor.dix
input:  ^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$   should output: snjúgva# seg<vblex><ind><pres><p3><sg>$ ^um<pr>$
input: ^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$, output: ^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><adv>$^,<cm>$  ^at<cnjsub>$
notice the extra space and the fact that you get <vblex><adv> not <vblex><inf>
  • +

16:35 firespeaker: $ echo "абай болмайсың ба" | apertium -d . kaz-kir-autoseq 16:35 firespeaker: ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ 16:35 firespeaker: oh, it's probably the + 16:35 firespeaker: seems to be okay with everything else 16:36 firespeaker: we'll need to ask spectie how we want to be dealing with this 16:38 irene_: what's the expected output? 16:39 begiak: apertium: jonorthwash * 81610: /nursery/apertium-kaz-kir/: Makefile.am, apertium-kaz-kir.kaz-kir.dix and 2 other files: kaz-kir-autoseq mode 16:39 irene_: of ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ 16:39 firespeaker: ^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ I guess

  • append <j/> with <t/>? => no
  • append <j/> with every </e> in lsx-comp, instead of writing the final <j/> in the dictionary => no, having lsx-comp append <j/> messes with paradigms


See also