Difference between revisions of "Apertium separable"
Line 122: | Line 122: | ||
==Naming Convention== |
==Naming Convention== |
||
<code>apertium-eng-cat.eng-cat.lsx</code>, <code>eng-cat.autoseq.bin</code> |
<code>apertium-eng-cat.eng-cat.lsx</code>, <code>eng-cat.autoseq.bin</code> |
||
==Resolved issues== |
|||
* kaz-eng |
|||
<pre> |
|||
$ echo "хабар еткен" | apertium-destxt | apertium -f none -d . kaz-eng-tagger | ~/source/apertium/branches/apertium-separable/src/lsx-proc kaz-eng.autoseq.bin |
|||
^хабарет<v><tv>$ ^хабарет<v><tv><past>$^хабарет<v><tv><past><p3>$^хабарет<v><tv><past><p3><sg>$^.<sent>$[][ |
|||
</pre> |
|||
* kaz-kir |
|||
15:35 firespeaker: http://svn.code.sf.net/p/apertium/svn/nursery/apertium-kaz-kir/apertium-kaz-kir.kaz-kir.lsx <br/> |
|||
15:35 firespeaker: with input ^абай<adj>$ ^бол<v><iv><imp><p2><sg>$ |
|||
*deu |
|||
wolfgangth Hi, I tested the new module for reordering separable multiwords and I have a problem if one of the entries (the last) has more then one word <br/> |
|||
wolfgangth before lsx-proc : ^heute Nachmittag<adv>$ wolfgangth after lsx-proc : ^heuteNachmittag<adv>$ <br/> |
|||
wolfgangth the blank was lost if it was part of a rule that was executed <br/> |
|||
* /p/apertium/svn/incubator/apertium-fao-nor/apertium-fao-nor.fao-nor.dix |
|||
<pre> |
|||
input: ^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$ should output: snjúgva# seg<vblex><ind><pres><p3><sg>$ ^um<pr>$ |
|||
input: ^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$, output: ^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><adv>$^,<cm>$ ^at<cnjsub>$ |
|||
notice the extra space and the fact that you get <vblex><adv> not <vblex><inf> |
|||
</pre> |
|||
* + |
|||
<pre> |
|||
16:35 firespeaker: $ echo "абай болмайсың ба" | apertium -d . kaz-kir-autoseq |
|||
16:35 firespeaker: ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ |
|||
16:35 firespeaker: oh, it's probably the + |
|||
16:35 firespeaker: seems to be okay with everything else |
|||
16:36 firespeaker: we'll need to ask spectie how we want to be dealing with this |
|||
16:38 irene_: what's the expected output? |
|||
16:39 begiak: apertium: jonorthwash * 81610: /nursery/apertium-kaz-kir/: Makefile.am, apertium-kaz-kir.kaz-kir.dix and 2 other files: kaz-kir-autoseq mode |
|||
16:39 irene_: of ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ |
|||
16:39 firespeaker: ^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ I guess |
|||
</pre> |
|||
* append <j/> with <t/>? => no |
|||
* append <j/> with every </e> in lsx-comp, instead of writing the final <j/> in the dictionary => no, having lsx-comp append <j/> messes with paradigms |
|||
* have the language-data writer write it explicitly in the .lsx file. |
|||
* lsx-comp doesn't register loop for ANY_TAG when in pair, only when in identity => fixed in matchTransduction() |
|||
** blow# out of the water, be# oppose to |
|||
** fao-nor |
|||
<pre> |
|||
<e lm="snjúgva seg um" c=""> |
|||
<p><l>snjúgva<s n="vblex"/></l><r>snjúgva<g><b/>seg</g><s n="vblex"/></r></p> |
|||
<i><t/><j/></i> |
|||
<p><l>seg<s n="prn"/><t/><j/>um<s n="pr"/></l><r>um<s n="pr"/></r></p> |
|||
<i><j/></i> |
|||
</e> |
|||
$ lt-print fao-nob.autoseq.bin |
|||
0 1 s s |
|||
1 2 n n |
|||
2 3 j j |
|||
3 4 ú ú |
|||
4 5 g g |
|||
5 6 v v |
|||
6 7 a a |
|||
7 8 <vblex> # |
|||
8 9 ε |
|||
9 10 ε s |
|||
10 11 ε e |
|||
11 12 ε g |
|||
12 13 ε <vblex> |
|||
13 14 <ANY_TAG> <ANY_TAG> |
|||
14 14 <ANY_TAG> <ANY_TAG> |
|||
14 15 <$> <$> |
|||
15 16 s u |
|||
16 17 e m |
|||
17 18 g <pr> |
|||
18 19 <prn> ε |
|||
19 20 <ANY_TAG> ε |
|||
20 21 <$> ε |
|||
21 22 u ε |
|||
22 23 m ε |
|||
23 24 <pr> ε |
|||
24 25 <$> <$> |
|||
25 |
|||
$ echo "^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin |
|||
^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$ |
|||
<e lm="halda fram, at" c=""> |
|||
<p><l>halda<s n="vblex"/></l><r>halda<g><b/>fram</g><s n="vblex"/></r></p> |
|||
<i><t/><j/></i> |
|||
<p><l>fram<s n="adv"/><j/>,<s n="cm"/><j/>at<s n="cnjsub"/></l><r>,<s n="cm"/><j/>at<s n="cnjsub"/><j/></r></p> |
|||
<i><j/></i> |
|||
</e> |
|||
$ echo "^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin |
|||
^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><inf>$^,<cm>$ ^at<cnjsub>$ ^$ |
|||
</pre> |
|||
==Troubleshooting== |
==Troubleshooting== |
Revision as of 03:10, 29 August 2017
Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.
Installing
Prerequisites and compilation are the same as lttoolbox and apertium. See Installation. On Debian/Ubuntu derivatives, it is part of the nightly repo as apt-get install apertium-separable
.
The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:
./autogen.sh ./configure make make install
It is not currently part of distributed Apertium binaries for other distros/OSs.
Lexical transfer in the pipeline
lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer: note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.
… | apertium-tagger -g en-es.prob | apertium-pretransfer | lsx-proc en-es.autoseq.bin | …
Usage
Creating the lsx-dictionary
Make a dictionary file:
<dictionary type="separable"> <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="adj"/> <sdef n="adv"/> <sdef n="n"/> <sdef n="sep"/> <sdef n="vblex"/> </sdefs> <pardefs> <pardef n="adj"> <e><i><w/><s n="adj"/><j/></i></e> <e><i><w/><s n="adj"/><t/><j/></i></e> </pardef> <pardef n="n"> <e><i><w/><s n="n"/><t/><j/></i></e> </pardef> <pardef n="SN"> <e><par n="n"/></e> <e><par n="adj"/><par n="n"/></e> <e><par n="adj"/><par n="adj"/><par n="n"/></e> </pardef> <pardef n="freq-adv"> <e><i>always<s n="adv"/><j/></i></e> <e><i>anually<s n="adv"/><j/></i></e> <e><i>bianually<s n="adv"/><j/></i></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="be late" c="llegar tarde"> <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/><s n="sep"/></r></p><i><t/><j/></i> <par n="SAdv"/><p><l>late<t/><j/></l><r></r></p> </e> <e lm="take away" c="sacar, quitar"> <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i> <par n="SN"/><p><l>away<t/><j/></l><r></r></p> </e> </section> </dictionary>
Note:
<w/>
stands for one or more alphabetic symbols<t/>
stands for one or more tags (multicharacter symbols).
i.e.
<e><w/>
is equivalent to<t/><j/></e>any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$>
- ^tall<adj><sint><...>$
<e><w/>
is equivalent to<j/></e>any-one-or-more-chars<adj><$>
- ^tall<adj>$
A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/apertium-eng-spa.eng-spa.lsx
Compilation
Compilation into the binary format is achieved by means of the lsx-comp program.
$ lsx-comp apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin main@standard 61 73
Processing
Processing can be done using the lsx-proc program.
The input to lsx-proc
is the output of apertium-tagger
and apertium-pretransfer
,
$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin ^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$
Example usages
Example #1: A sentence in plain text,
The Aragonese took Ramiro out of a monastery and made him king.
This is the output of feeding the sentence through apertium-tagger
:
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$
This is the output of feeding the output above through lsx-proc
with apertium-eng-spa.eng-spa.lsx:
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$
Naming Convention
apertium-eng-cat.eng-cat.lsx
, eng-cat.autoseq.bin
Troubleshooting
Segmentation fault
The lsx-dictionary compiles fine with zero entries but gives a seg fault once entries are added:
error appears on (linux machine?) ...no solution found yet