Difference between revisions of "Apertium separable"

Revision as of 09:16, 18 August 2017

Installing

Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. On Debian/Ubuntu derivatives, it is part of the nightly repo as apt-get install apertium-separable.

The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:

./autogen.sh
./configure
make

It is not currently part of distributed Apertium binaries for other distros/OSs.

Lexical transfer in the pipeline

lsx-proc runs between apertium-tagger and apertium-pretransfer:

… | apertium-tagger -g eng.prob | lsx-proc english.bin | apertium-pretransfer | …

Example

A sentence in plain text,

Thus, it was asserted that a tax on foreign workers would reduce the numbers coming in and “taking jobs away” from American citizens.

This is the output of feeding the sentence through apertium-tagger :

^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$

This is the output of feeding the output above through lsx-proc :

^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take# away<vblex><sep><ger>$ ^job<n><pl>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$

Compilation and Usage

Make a dictionary file:

<dictionary type="separable">
    <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
    <sdefs>
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
    </sdefs>
    <pardefs>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        </pardef>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        </pardef>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        </pardef>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e>
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
        </pardef>
    </pardefs>
    <section id="main" type="standard">
        <e lm="be late" c="llegar tarde">
            <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SAdv"/><p><l>late<t/><j/></l><r></r></p>
        </e>
        <e lm="take away" c="sacar, quitar">
            <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SN"/><p><l>away<t/><j/></l><r></r></p>
        </e>
    </section>
</dictionary>

Note:

<w/> stands for one or more alphabetic symbols
<t/> stands for one or more tags (multicharacter symbols).

i.e.

<e><w/><t/><j/></e> is equivalent to any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$>
- ^tall<adj><sint><...>$
<e><w/><j/></e> is equivalent to any-one-or-more-chars<adj><$>
- ^tall<adj>$

Then compile it:

$ lsx-comp dictionary.xml english.bin
main@standard 61 73

The input to lsx-proc is the output of apertium-tagger ,

$ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin

A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/new-example.dix

Preparedness of languages

Languages that beta-testing the module:

eng

Todo and bugs

~~Decide whether the lsx module is part of monolingual modules, language pairs, either, or both.~~
~~Instead of dictionary.xml and english.bin and the like, we should have standardised naming conventions. Some options/proposals:~~
- ~~eng-cat.autolsx.xml, eng-cat.autolsx.bin~~
- ~~eng-cat.autosep.lsx, eng-cat.autosep.bin~~
- apertium-eng-cat.eng-cat.lsx, eng-cat.autoseq.bin

blow# out of the water, be# oppose to

10:53 firespeaker: pektii: if we offload multiwords from the transducers to lsx, (1) how do we do N N compounds with lsx? (2) how does translation *to* a multiword work?

documentation for using lsx

        <e lm="snjúgva seg um" c="">
           <p><l>snjúgva<s n="vblex"/></l><r>snjúgva<g><b/>seg</g><s n="vblex"/></r></p>
           <i><t/><j/></i>
           <p><l>seg<s n="prn"/><t/><j/>um<s n="pr"/></l><r>um<s n="pr"/></r></p>
           <i><j/></i>
        </e>

$ lt-print fao-nob.autoseq.bin
0	1	s	s	
1	2	n	n	
2	3	j	j	
3	4	ú	ú	
4	5	g	g	
5	6	v	v	
6	7	a	a	
7	8	<vblex>	#	
8	9	ε	 	
9	10	ε	s	
10	11	ε	e	
11	12	ε	g	
12	13	ε	<vblex>	
13	14	<ANY_TAG>	<ANY_TAG>	
14	14	<ANY_TAG>	<ANY_TAG>	
14	15	<$>	<$>	
15	16	s	u	
16	17	e	m	
17	18	g	<pr>	
18	19	<prn>	ε	
19	20	<ANY_TAG>	ε	
20	21	<$>	ε	
21	22	u	ε	
22	23	m	ε	
23	24	<pr>	ε	
24	25	<$>	<$>	
25

$ echo "^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin 
^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$

        <e lm="halda fram, at" c="">
           <p><l>halda<s n="vblex"/></l><r>halda<g><b/>fram</g><s n="vblex"/></r></p>
           <i><t/><j/></i>
           <p><l>fram<s n="adv"/><j/>,<s n="cm"/><j/>at<s n="cnjsub"/></l><r>,<s n="cm"/><j/>at<s n="cnjsub"/><j/></r></p>
           <i><j/></i>
        </e>


$ echo "^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin 
^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><inf>$^,<cm>$ ^at<cnjsub>$ ^$

Resolved issues

kaz-eng

$ echo "хабар еткен" | apertium-destxt | apertium -f none -d . kaz-eng-tagger | ~/source/apertium/branches/apertium-separable/src/lsx-proc kaz-eng.autoseq.bin 
 ^хабарет<v><tv>$ ^хабарет<v><tv><past>$^хабарет<v><tv><past><p3>$^хабарет<v><tv><past><p3><sg>$^.<sent>$[][

kaz-kir

15:35 firespeaker: http://svn.code.sf.net/p/apertium/svn/nursery/apertium-kaz-kir/apertium-kaz-kir.kaz-kir.lsx
15:35 firespeaker: with input ^абай<adj>$ ^бол<v><iv><imp><p2><sg>$

deu

wolfgangth Hi, I tested the new module for reordering separable multiwords and I have a problem if one of the entries (the last) has more then one word
wolfgangth before lsx-proc : ^heute Nachmittag<adv>$ wolfgangth after lsx-proc : ^heuteNachmittag<adv>$
wolfgangth the blank was lost if it was part of a rule that was executed

/p/apertium/svn/incubator/apertium-fao-nor/apertium-fao-nor.fao-nor.dix

input:  ^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$   should output: snjúgva# seg<vblex><ind><pres><p3><sg>$ ^um<pr>$
input: ^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$, output: ^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><adv>$^,<cm>$  ^at<cnjsub>$
notice the extra space and the fact that you get <vblex><adv> not <vblex><inf>

+

16:35 firespeaker: $ echo "абай болмайсың ба" | apertium -d . kaz-kir-autoseq 16:35 firespeaker: ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ 16:35 firespeaker: oh, it's probably the + 16:35 firespeaker: seems to be okay with everything else 16:36 firespeaker: we'll need to ask spectie how we want to be dealing with this 16:38 irene_: what's the expected output? 16:39 begiak: apertium: jonorthwash * 81610: /nursery/apertium-kaz-kir/: Makefile.am, apertium-kaz-kir.kaz-kir.dix and 2 other files: kaz-kir-autoseq mode 16:39 irene_: of ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ 16:39 firespeaker: ^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ I guess

append <j/> with <t/>? => no

append <j/> with every </e> in lsx-comp, instead of writing the final <j/> in the dictionary => no, having lsx-comp append <j/> messes with paradigms

Troubleshooting

References

https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable
project proposal and workplan

Difference between revisions of "Apertium separable"

Revision as of 09:16, 18 August 2017

Contents

Installing

Lexical transfer in the pipeline

Example

Compilation and Usage

Preparedness of languages

Todo and bugs

Resolved issues

Troubleshooting

See also

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools