Difference between revisions of "Apertium separable"
| Line 32: | Line 32: | ||
| </pre> | </pre> | ||
| ==Usage== | ==Compilation and Usage== | ||
| Make a dictionary file: | Make a dictionary file: | ||
| Line 82: | Line 82: | ||
| * {{tag|w/}} stands for one or more alphabetic symbols | * {{tag|w/}} stands for one or more alphabetic symbols | ||
| * {{tag|t/}} stands for one or more tags (multicharacter symbols). | * {{tag|t/}} stands for one or more tags (multicharacter symbols). | ||
| i.e. | |||
| * <code> <e><i><w/><s n="adj"/><t/><j/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><required-anytag><optional-anytag><...> </code> | |||
| * <code> <e><i><w/><s n="adj"/><j/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><optional-anytag><...> </code> | |||
| Then compile it: | Then compile it: | ||
| Line 95: | Line 99: | ||
| $ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin | $ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin | ||
| </pre> | </pre> | ||
| A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/new-example.dix | |||
| ==Dictionary format== | ==Dictionary format== | ||
Revision as of 15:40, 9 August 2017
Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.
Installing
Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/src and instructions for compilation are below.
It is not currently part of distributed Apertium binaries.
Lexical transfer in the pipeline
lsx-proc runs between apertium-tagger and apertium-pretransfer:
… | apertium-tagger -g eng.prob | lsx-proc english.bin | apertium-pretransfer | …
Example
A sentence in plain text,
Thus, it was asserted that a tax on foreign workers would reduce the numbers coming in and “taking jobs away” from American citizens.
This is the output of feeding the sentence through  apertium-tagger :
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$
This is the output of feeding the output above through  lsx-proc :
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take# away<vblex><sep><ger>$ ^job<n><pl>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$
Compilation and Usage
Make a dictionary file:
<dictionary type="separable">
    <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
    <sdefs>
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
    </sdefs>
    <pardefs>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        </pardef>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        </pardef>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        </pardef>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e>
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
        </pardef>
    </pardefs>
    <section id="main" type="standard">
        <e lm="be late" c="llegar tarde">
            <p><l>be<s n="vblex"/></l><r>be<g><b/>late</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="freq-adv"/><p><l>late<t/></l><r></r></p>
        </e>
        <e lm="take away" c="sacar, quitar">
            <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
            <par n="SN"/><p><l>away<t/></l><r></r></p>
        </e>
    </section>
</dictionary>
Note:
- <w/>stands for one or more alphabetic symbols
- <t/>stands for one or more tags (multicharacter symbols).
i.e.
- <e><w/>is equivalent to- <t/><j/></e>- any-one-or-more-chars<adj><required-anytag><optional-anytag><...>
- <e><w/>is equivalent to- <j/></e>- any-one-or-more-chars<adj><optional-anytag><...>
Then compile it:
$ lsx-comp dictionary.xml english.bin main@standard 61 73
The input to  lsx-proc  is the output of  apertium-tagger ,
$ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin
A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/new-example.dix
Dictionary format
A paradigm is made up of:
A dictionary entry is made up of:
Preparedness of languages
| Language | entries | 
|---|---|
|  apertium-eng  | 18,563 | 
Todo and bugs
- Decide whether the lsx module is part of monolingual modules, language pairs, either, or both.
- Instead of dictionary.xmlandenglish.binand the like, we should have standardised naming conventions. Some options/proposals:- eng-cat.autolsx.xml,- eng-cat.autolsx.bin
- eng-cat.autosep.lsx,- eng-cat.autosep.bin
- ...
 

