Difference between revisions of "Apertium separable"
Line 157: | Line 157: | ||
* https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable |
* https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable |
||
* project [[User:Irene/proposal | proposal]] and [[User:Irene/workplan | workplan]] |
* project [[User:Irene/proposal | proposal]] and [[User:Irene/workplan | workplan]] |
||
[[Category:Documentation in English]] |
Revision as of 18:28, 12 August 2017
Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.
Installing
Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:
./autogen.sh ./configure make
It is not currently part of distributed Apertium binaries.
Lexical transfer in the pipeline
lsx-proc runs between apertium-tagger and apertium-pretransfer:
… | apertium-tagger -g eng.prob | lsx-proc english.bin | apertium-pretransfer | …
Example
A sentence in plain text,
Thus, it was asserted that a tax on foreign workers would reduce the numbers coming in and “taking jobs away” from American citizens.
This is the output of feeding the sentence through apertium-tagger
:
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$
This is the output of feeding the output above through lsx-proc
:
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take# away<vblex><sep><ger>$ ^job<n><pl>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$
Compilation and Usage
Make a dictionary file:
<dictionary type="separable"> <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="adj"/> <sdef n="adv"/> <sdef n="n"/> <sdef n="sep"/> <sdef n="vblex"/> </sdefs> <pardefs> <pardef n="adj"> <e><i><w/><s n="adj"/><j/></i></e> <e><i><w/><s n="adj"/><t/><j/></i></e> </pardef> <pardef n="n"> <e><i><w/><s n="n"/><t/><j/></i></e> </pardef> <pardef n="SN"> <e><par n="n"/></e> <e><par n="adj"/><par n="n"/></e> <e><par n="adj"/><par n="adj"/><par n="n"/></e> </pardef> <pardef n="freq-adv"> <e><i>always<s n="adv"/><j/></i></e> <e><i>anually<s n="adv"/><j/></i></e> <e><i>bianually<s n="adv"/><j/></i></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="be late" c="llegar tarde"> <p><l>be<s n="vblex"/></l><r>be<g><b/>late</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i> <par n="freq-adv"/><p><l>late<t/></l><r></r></p> </e> <e lm="take away" c="sacar, quitar"> <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i> <par n="SN"/><p><l>away<t/></l><r></r></p> </e> </section> </dictionary>
Note:
<w/>
stands for one or more alphabetic symbols<t/>
stands for one or more tags (multicharacter symbols).
i.e.
<e><w/>
is equivalent to<t/><j/></e>any-one-or-more-chars<adj><required-anytag><optional-anytag><...>
<e><w/>
is equivalent to<j/></e>any-one-or-more-chars<adj><optional-anytag><...>
Then compile it:
$ lsx-comp dictionary.xml english.bin main@standard 61 73
The input to lsx-proc
is the output of apertium-tagger
,
$ echo '^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ <b>“^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$”</b> ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$' | lsx-proc english.bin
A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/new-example.dix
Dictionary format
A paradigm is made up of:
A dictionary entry is made up of:
Preparedness of languages
Languages that beta-testing the module:
- eng
Todo and bugs
- Decide whether the lsx module is part of monolingual modules, language pairs, either, or both.
- Instead of
dictionary.xml
andenglish.bin
and the like, we should have standardised naming conventions. Some options/proposals:eng-cat.autolsx.xml
,eng-cat.autolsx.bin
eng-cat.autosep.lsx
,eng-cat.autosep.bin
apertium-eng-cat.eng-cat.lsx
,eng-cat.autoseq.bin
- kaz-eng
$ echo "хабар еткен" | apertium-destxt | apertium -f none -d . kaz-eng-tagger | ~/source/apertium/branches/apertium-separable/src/lsx-proc kaz-eng.autoseq.bin ^хабарет<v><tv>$ ^хабарет<v><tv><past>$^хабарет<v><tv><past><p3>$^хабарет<v><tv><past><p3><sg>$^.<sent>$[][ * /p/apertium/svn/incubator/apertium-fao-nor/apertium-fao-nor.fao-nor.dix ** input: ^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$ should output: snjúgva# seg<vblex><ind><pres><p3><sg>$ ^um<pr>$ ** input: ^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$, output: ^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><adv>$^,<cm>$ ^at<cnjsub>$ *** notice the extra space and the fact that you get <vblex><adv> not <vblex><inf> * blow# out of the water * wolfgangth Hi, I tested the new module for reordering separable multiwords and I have a problem if one of the entries (the last) has more then one word wolfgangth before lsx-proc : ^heute Nachmittag<adv>$ wolfgangth after lsx-proc : ^heuteNachmittag<adv>$ wolfgangth the blank was lost if it was part of a rule that was executed