Difference between revisions of "Apertium separable"
(52 intermediate revisions by 9 users not shown) | |||
Line 4: | Line 4: | ||
==Installing== |
==Installing== |
||
The module is part of the [[nightly repositories]] as <code>apt-get install apertium-separable</code>. |
|||
If you'd like to compile it manually—e.g., for development purposes—you can follow these instructions: |
|||
The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are: |
|||
Prerequisites and compilation are the same as lttoolbox and apertium. See [[Installation]]. |
|||
The code can be found at [https://github.com/apertium/apertium-separable https://github.com/apertium/apertium-separable], and instructions for compiling the module are: |
|||
<pre> |
<pre> |
||
Line 12: | Line 16: | ||
./configure |
./configure |
||
make |
make |
||
make install |
|||
</pre> |
</pre> |
||
You'll need <code>lttoolbox</code> from git (or, greater than the current release 3.3.3) and associated libraries, and <code>zlib</code> (debian: <code>zlib1g-dev</code>). |
|||
It is not currently part of distributed Apertium binaries for other distros/OSs. |
|||
==Lexical transfer in the pipeline== |
==Lexical transfer in the pipeline== |
||
lsx-proc runs |
lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer: <br/> |
||
(note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.) |
|||
<pre> |
<pre> |
||
… | apertium-tagger -g |
… | apertium-tagger -g en-es.prob | apertium-pretransfer | lsx-proc en-es.autoseq.bin | … |
||
</pre> |
</pre> |
||
== |
==Usage== |
||
A sentence in plain text, |
|||
===Creating the lsx-dictionary=== |
|||
The lsx dictionary format is largely similar to those of the [[Morphological_dictionary | morphological]] and [[Bilingual_dictionary | bilingual]] dictionaries. (see also: [[Apertium_New_Language_Pair_HOWTO]]) |
|||
We begin with a declaration of the dictionary. There is currently nothing in it, only a declaration that we want to begin a new dictionary. |
|||
<pre> |
<pre> |
||
<dictionary type="separable"> |
|||
Thus, it was asserted that a tax on foreign workers would reduce the numbers coming in and “taking jobs away” from American citizens. |
|||
</dictionary> |
|||
</pre> |
</pre> |
||
Then add the alphabet entry, this can be empty as the alphabet is only used for tokenisation and the lsx module comes after the text is tokenised. Now we have: |
|||
This is the output of feeding the sentence through <code> apertium-tagger </code>: |
|||
<pre> |
<pre> |
||
<dictionary type="separable"> |
|||
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take<vblex><ger>$ ^job<n><pl>$ ^away<adv>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$ |
|||
<alphabet></alphabet> |
|||
</dictionary> |
|||
</pre> |
</pre> |
||
Next we need to add the symbol definitions, abbreviated to sdefs. These are the symbols that your words are tagged with, e.g. noun or verb or adj. Again, you should be able to just copy the sdef section from your language's monodix, and it should contain many more than in this basic example. |
|||
This is the output of feeding the output above through <code> lsx-proc </code>: |
|||
<pre> |
<pre> |
||
<dictionary type="separable"> |
|||
^thus<adv>$^,<cm>$ ^prpers<prn><subj><p3><nt><sg>$ ^be<vbser><past><p3><sg>$ ^assert<vblex><pp>$ ^that<prn><tn><mf><sg>$ ^a<det><ind><sg>$ ^tax<n><sg>$ ^on<pr>$ ^foreign<adj>$ ^worker<n><pl>$ ^would<vaux><inf>$ ^reduce<vblex><inf>$ ^the<det><def><sp>$ ^number<vblex><pri><p3><sg>$ ^come<vblex><ger># in$ ^and<cnjcoo>$ “^take# away<vblex><sep><ger>$ ^job<n><pl>$” ^from<pr>$ ^american<adj>$ ^citizen<n><pl>$^.<sent>$^.<sent>$ |
|||
<alphabet></alphabet> |
|||
<sdefs> |
|||
<sdef n="adj"/> |
|||
<sdef n="adv"/> |
|||
<sdef n="n"/> |
|||
<sdef n="sep"/> |
|||
<sdef n="vblex"/> |
|||
</sdefs> |
|||
</dictionary> |
|||
</pre> |
</pre> |
||
Now we need to add the paradigm definitions, abbreviated to pardefs. These represent patterns of word orders. The following example represents words tagged as adjective, noun, noun phrase, and frequency adjectives. See the note below about the tags {{tag|w/}}, {{tag|t/}}, {{tag|j/}}. The lemma can be represented as anychars ({{tag|w/}}, such as in adj and n below; or by typing out the word itself, such as in freq-adv below. Pardefs can be used to create other pardefs, such as in SN below. Adding paradigms into the dictionary, we get: |
|||
==Compilation and Usage== |
|||
<pre> |
|||
<dictionary type="separable"> |
|||
<alphabet></alphabet> |
|||
<sdefs> |
|||
... |
|||
</sdefs> |
|||
<pardefs> |
|||
<pardef n="adj"> <!-- to represent all adjectives --> |
|||
<e><i><w/><s n="adj"/><d/></i></e> <!-- word only has the adj tag --> |
|||
<e><i><w/><s n="adj"/><t/><d/></i></e> <!-- word has the adj tag followed by one or more other tags --> |
|||
</pardef> |
|||
<pardef n="n"> #to represent all nouns |
|||
<e><i><w/><s n="n"/><t/><d/></i></e> <!-- word has the n tag followed by one or more other tags --> |
|||
</pardef> |
|||
<pardef n="SN"> #to represent all noun phrases |
|||
<e><par n="n"/></e> |
|||
<e><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of an adjective word followed by a noun word --> |
|||
<e><par n="adj"/><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of two adjectives followed by a noun --> |
|||
</pardef> |
|||
<pardef n="freq-adv"> |
|||
<e><i>always<s n="adv"/><d/></i></e> <!-- i.e. ^always<adv>$ --> |
|||
<e><i>anually<s n="adv"/><d/></i></e> |
|||
<e><i>bianually<s n="adv"/><d/></i></e> |
|||
</pardef> |
|||
</pardefs> |
|||
</dictionary> |
|||
</pre> |
|||
Finally, we add the main entries. Here is the final result of our small example dictionary: |
|||
Make a dictionary file: |
|||
<pre> |
<pre> |
||
<dictionary type="separable"> |
<dictionary type="separable"> |
||
<alphabet> |
<alphabet></alphabet> |
||
<sdefs> |
<sdefs> |
||
<sdef n="adj"/> |
<sdef n="adj"/> |
||
Line 55: | Line 104: | ||
<pardefs> |
<pardefs> |
||
<pardef n="adj"> |
<pardef n="adj"> |
||
<e><i><w/><s n="adj"/>< |
<e><i><w/><s n="adj"/><d/></i></e> |
||
<e><i><w/><s n="adj"/><t/>< |
<e><i><w/><s n="adj"/><t/><d/></i></e> |
||
</pardef> |
</pardef> |
||
<pardef n="n"> |
<pardef n="n"> |
||
<e><i><w/><s n="n"/><t/>< |
<e><i><w/><s n="n"/><t/><d/></i></e> |
||
</pardef> |
</pardef> |
||
<pardef n="SN"> |
<pardef n="SN"> |
||
Line 67: | Line 116: | ||
</pardef> |
</pardef> |
||
<pardef n="freq-adv"> |
<pardef n="freq-adv"> |
||
<e><i>always<s n="adv"/>< |
<e><i>always<s n="adv"/><d/></i></e> |
||
<e><i>anually<s n="adv"/>< |
<e><i>anually<s n="adv"/><d/></i></e> |
||
<e><i>bianually<s n="adv"/>< |
<e><i>bianually<s n="adv"/><d/></i></e> |
||
</pardef> |
</pardef> |
||
</pardefs> |
</pardefs> |
||
<section id="main" type="standard"> |
<section id="main" type="standard"> |
||
<e lm="be late" c="llegar tarde"> |
<e lm="be late" c="llegar tarde"> |
||
<p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser |
<p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/></r></p><i><t/><d/></i> |
||
<par n="SAdv"/><p><l>late<t/>< |
<par n="SAdv"/><p><l>late<t/><d/></l><r></r></p> |
||
</e> |
</e> |
||
<e lm="take away" c="sacar, quitar"> |
<e lm="take away" c="sacar, quitar"> |
||
<p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex |
<p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/></r></p><i><t/><d/></i> |
||
<par n="SN"/><p><l>away<t/>< |
<par n="SN"/><p><l>away<t/><d/></l><r></r></p> |
||
</e> |
</e> |
||
</section> |
</section> |
||
Line 89: | Line 138: | ||
* {{tag|w/}} stands for one or more alphabetic symbols |
* {{tag|w/}} stands for one or more alphabetic symbols |
||
* {{tag|t/}} stands for one or more tags (multicharacter symbols). |
* {{tag|t/}} stands for one or more tags (multicharacter symbols). |
||
* {{tag|d/}} stands for the word boundary symbol $ |
|||
i.e. |
i.e. |
||
* <code> <e><i><w/><s n="adj"/><t/>< |
* <code> <e><i><w/><s n="adj"/><t/><d/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$> </code> |
||
** ^tall<adj><sint><...>$ |
** ^tall<adj><sint><...>$ |
||
* <code> <e><i><w/><s n="adj"/>< |
* <code> <e><i><w/><s n="adj"/><d/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><$> </code> |
||
** ^tall<adj>$ |
** ^tall<adj>$ |
||
A larger example dictionary can be found at https://github.com/apertium/apertium-separable/blob/master/examples/apertium-eng-spa.eng-spa.lsx. |
|||
Then compile it: |
|||
The lsx dictionary file names are of the form <code> apertium-A-B.A-B.lsx </code>, where apertium-A-B is the name of the language pair. For example, file <code>apertium-eng-cat.eng-cat.lsx</code> is the lsx dictionary for the <code> eng-cat </code> pair. The names of the compiled binaries are of the form <code> apertium-A-B.autoseq.bin </code>. For example, <code> eng-cat.autoseq.bin </code>. |
|||
===Compilation=== |
|||
Compilation into the binary format is achieved by means of the lsx-comp program. Specifying lr as the mode will produce an analyser, and rl will produce a generator. |
|||
<pre> |
<pre> |
||
$ lsx-comp apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin |
$ lsx-comp lr apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin |
||
main@standard 61 73 |
main@standard 61 73 |
||
</pre> |
</pre> |
||
===Processing=== |
|||
The input to <code> lsx-proc </code> is the output of <code> apertium-tagger </code>, |
|||
Processing can be done using the lsx-proc program. |
|||
The input to <code> lsx-proc </code> is the output of <code> apertium-tagger </code> and <code> apertium-pretransfer </code>, |
|||
<pre> |
<pre> |
||
$ echo '^ |
$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin |
||
^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$ |
|||
</pre> |
</pre> |
||
===Example usages=== |
|||
Example #1: |
|||
A sentence in plain text, |
|||
<pre> |
|||
The Aragonese took Ramiro out of a monastery and made him king. |
|||
</pre> |
|||
This is the output of feeding the sentence through <code> apertium-tagger </code> and then <code> apertium-pretransfer </code>: |
|||
A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/apertium-eng-spa.eng-spa.lsx |
|||
<pre> |
|||
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$ |
|||
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$ |
|||
</pre> |
|||
This is the output of feeding the output above through <code> lsx-proc </code> with apertium-eng-spa.eng-spa.lsx: |
|||
==Preparedness of languages== |
|||
<pre> |
|||
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$ |
|||
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$ |
|||
</pre> |
|||
==Matching forms== |
|||
Languages that beta-testing the module: |
|||
* eng |
|||
You can also use lsx-proc on readings that include forms. Run <code>apertium-tagger</code> with <code>-p</code> to ensure forms are not stripped off, and run <code>lsx-proc</code> with <code>-p</code> to enable analysing forms. Use <code><f/></code> to match the <code>/</code> between form and reading: |
|||
==Todo and bugs== |
|||
* <s>Decide whether the lsx module is part of monolingual modules, language pairs, either, or both.</s> |
|||
* <s>Instead of <code>dictionary.xml</code> and <code>english.bin</code> and the like, we should have standardised naming conventions. Some options/proposals:</s> |
|||
** <s><code>eng-cat.autolsx.xml</code>, <code>eng-cat.autolsx.bin</code></s> |
|||
** <s><code>eng-cat.autosep.lsx</code>, <code>eng-cat.autosep.bin</code></s> |
|||
** <code>apertium-eng-cat.eng-cat.lsx</code>, <code>eng-cat.autoseq.bin</code> |
|||
<pre> |
|||
* 10:53 firespeaker: pektii: if we offload multiwords from the transducers to lsx, (1) how do we do N N compounds with lsx? (2) how does translation *to* a multiword work? |
|||
$ cat sep.lsx |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<dictionary type="separable"> |
|||
<alphabet></alphabet> |
|||
<sdefs> |
|||
<sdef n="np"/> |
|||
<sdef n="pr"/> |
|||
<sdef n="vblex"/> |
|||
<sdef n="adv"/> |
|||
</sdefs> |
|||
<pardefs> |
|||
* documentation for using lsx |
|||
<pardef n="reading" c="match and keep readings (incl. tagless/unknown). Includes end delimiter"> |
|||
==Resolved issues== |
|||
<e> <i><f/><w/><d/></i> </e> |
|||
<e> <i><f/><w/><t/><d/></i> </e> |
|||
</pardef> |
|||
<pardef n="reading:" c="match and drop readings (incl. tagless/unknown). Includes end delimiter"> |
|||
* kaz-eng |
|||
<e><p><l><f/><w/><d/></l> <r/></p></e> |
|||
<pre> |
|||
<e><p><l><f/><w/><t/><d/></l><r/></p></e> |
|||
$ echo "хабар еткен" | apertium-destxt | apertium -f none -d . kaz-eng-tagger | ~/source/apertium/branches/apertium-separable/src/lsx-proc kaz-eng.autoseq.bin |
|||
</pardef> |
|||
^хабарет<v><tv>$ ^хабарет<v><tv><past>$^хабарет<v><tv><past><p3>$^хабарет<v><tv><past><p3><sg>$^.<sent>$[][ |
|||
</pre> |
|||
<pardef n="pr|jf" c="includes end delimiter"> |
|||
* kaz-kir |
|||
<e><i><w/><f/><w/><s n="pr"/><t/><d/></i></e> |
|||
15:35 firespeaker: http://svn.code.sf.net/p/apertium/svn/nursery/apertium-kaz-kir/apertium-kaz-kir.kaz-kir.lsx <br/> |
|||
<e><i>jf.</i> <par n="reading"/></e> |
|||
15:35 firespeaker: with input ^абай<adj>$ ^бол<v><iv><imp><p2><sg>$ |
|||
</pardef> |
|||
</pardefs> |
|||
*deu |
|||
wolfgangth Hi, I tested the new module for reordering separable multiwords and I have a problem if one of the entries (the last) has more then one word <br/> |
|||
wolfgangth before lsx-proc : ^heute Nachmittag<adv>$ wolfgangth after lsx-proc : ^heuteNachmittag<adv>$ <br/> |
|||
wolfgangth the blank was lost if it was part of a rule that was executed <br/> |
|||
<section id="main" type="standard"> |
|||
* /p/apertium/svn/incubator/apertium-fao-nor/apertium-fao-nor.fao-nor.dix |
|||
<pre> |
|||
input: ^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$ should output: snjúgva# seg<vblex><ind><pres><p3><sg>$ ^um<pr>$ |
|||
input: ^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$, output: ^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><adv>$^,<cm>$ ^at<cnjsub>$ |
|||
notice the extra space and the fact that you get <vblex><adv> not <vblex><inf> |
|||
</pre> |
|||
<e c="merge"> |
|||
* + |
|||
<par n="pr|jf"/> |
|||
<pre> |
|||
<p><l>lov</l> <r></r></p> <par n="reading:"/> |
|||
16:35 firespeaker: $ echo "абай болмайсың ба" | apertium -d . kaz-kir-autoseq |
|||
<p><l>om</l> <r></r></p> <par n="reading:"/> |
|||
16:35 firespeaker: ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ |
|||
<p><l>kake</l> <r></r></p> <par n="reading:"/> |
|||
16:35 firespeaker: oh, it's probably the + |
|||
<p><l></l> <r>lov<b/>om<b/>kake<s n="np"/><d/></r></p> |
|||
16:35 firespeaker: seems to be okay with everything else |
|||
</e> |
|||
16:36 firespeaker: we'll need to ask spectie how we want to be dealing with this |
|||
16:38 irene_: what's the expected output? |
|||
<e c="split"> |
|||
16:39 begiak: apertium: jonorthwash * 81610: /nursery/apertium-kaz-kir/: Makefile.am, apertium-kaz-kir.kaz-kir.dix and 2 other files: kaz-kir-autoseq mode |
|||
<p><l>wouldnae</l> <r></r></p> <par n="reading:"/> |
|||
16:39 irene_: of ^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ |
|||
<p><l></l> <r>would<f/>will<s n="vblex"/><d space="no"/></r></p> |
|||
16:39 firespeaker: ^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ I guess |
|||
<p><l></l> <r>nae<f/>not<s n="adv"/><d/></r></p> |
|||
</e> |
|||
<!-- |
|||
The "reading" pardefs above will match all readings (even |
|||
unknowns), so they really just filter on form. |
|||
But we could easily match on lemmas/tags as well, e.g the below entry would turn |
|||
^aint/benot<vblex><adv>$ |
|||
into |
|||
^ai/be<vblex>$ ^nt/not<adv>$ |
|||
but would not match |
|||
^aint/havenot<vblex><adv>$ |
|||
--> |
|||
<e c="require certain lemma/tags"> |
|||
<p><l>aint<f/>benot<s n="vblex"/><s n="adv"/><d/></l> <r></r></p> |
|||
<p><l></l> <r>ai<f/>be<s n="vblex"/><d space="no"/></r></p> |
|||
<p><l></l> <r>nt<f/>not<s n="adv"/><d/></r></p> |
|||
</e> |
|||
</section> |
|||
</dictionary> |
|||
$ lsx-comp lr sep.lsx sep.bin |
|||
main@standard 99 118 |
|||
$ echo '^jf./jf.<pr>$ ^lov/lov<n><m>$ ^om/om<pr>$ ^kake/kake<n><f>$' | lsx-proc -w -p sep.bin |
|||
^jf./jf.<pr>$ ^lov/lov om kake<np>$ |
|||
$ echo '^wouldnae/willnot<vblex><adv>$' | lsx-proc -w -p sep.bin |
|||
^would/will<vblex>$^nae/not<adv>$ |
|||
$ echo '^wouldnae/*wouldnae$' | lsx-proc -w -p sep.bin |
|||
^would/will<vblex>$^nae/not<adv>$ |
|||
$ echo '^aint/benot<vblex><adv>$' | lsx-proc -w -p sep.bin |
|||
^ai/be<vblex>$^nt/not<adv>$ |
|||
$ echo '^aint/havenot<vblex><adv>$' | lsx-proc -w -p sep.bin |
|||
^aint/havenot<vblex><adv>$ |
|||
</pre> |
</pre> |
||
'''NB''': If you are using [[HFST]] to create your lsx binary, you will need to run <code>lt-comp</code> with the <code>-S</code> option on your [[ATT]] file, e.g. |
|||
* append <j/> with <t/>? => no |
|||
<code>lt-comp -S lr sep.att sep.bin</code> |
|||
==Troubleshooting== |
|||
* append <j/> with every </e> in lsx-comp, instead of writing the final <j/> in the dictionary => no, having lsx-comp append <j/> messes with paradigms |
|||
===Undefined symbol=== |
|||
In your dictionary you are probably using a symbol that you didn't define in the sdefs. Add the symbol to the sdefs. |
|||
==Future work== |
|||
* have the language-data writer write it explicitly in the .lsx file. |
|||
=== Offloading multiwords from transducers to lsx === |
|||
In theory we're offloading multiwords from the transducers to lsx. This leaves open some questions: |
|||
* how do we do N N compounds with lsx? |
|||
* how does translation ''to'' a multiword work? In theory it's possible to invert the transducer, but an attempt to try this results in a transducer that looks right but silently fails to apply to input. Also, it will need to be able to handle the output of transfer. —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 00:02, 1 September 2017 (CEST) |
|||
=== Recycling dictionaries and/or paradigms === |
|||
lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity? |
|||
=== Beta testing === |
|||
Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module: |
|||
** eng-cat |
|||
** eng-deu (?) |
|||
** kaz-kir |
|||
Beta test with more language pairs |
|||
=== Transfer-like super powers === |
|||
* lsx-comp doesn't register loop for ANY_TAG when in pair, only when in identity => fixed in matchTransduction() |
|||
* Transfer-like capabilities for the lexicon (super powers). E.g., gustar / like |
|||
** blow# out of the water, be# oppose to |
|||
** fao-nor |
|||
=== The one-to-many bug === |
|||
Given the following lsx file: |
|||
<pre> |
<pre> |
||
<dictionary type="sequential"> |
|||
<e lm="snjúgva seg um" c=""> |
|||
<alphabet>АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҰҮФХҺЦЧШЩЬЫЪЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуұүфхһцчшщьыъэюя</alphabet> |
|||
<p><l>snjúgva<s n="vblex"/></l><r>snjúgva<g><b/>seg</g><s n="vblex"/></r></p> |
|||
<sdefs> |
|||
<i><t/><j/></i> |
|||
<sdef n="adj"/> |
|||
<p><l>seg<s n="prn"/><t/><j/>um<s n="pr"/></l><r>um<s n="pr"/></r></p> |
|||
<sdef n="adv"/> |
|||
<i><j/></i> |
|||
<sdef n="n"/> |
|||
</e> |
|||
<sdef n="nom"/> |
|||
<sdef n="dat"/> |
|||
<sdef n="v"/> |
|||
</sdefs> |
|||
<pardefs> |
|||
<pardef n="adj"> |
|||
<e><i><w/><s n="adj"/><d/></i></e> |
|||
<e><i><w/><s n="adj"/><t/><d/></i></e> |
|||
</pardef> |
|||
<pardef n="n"> |
|||
<e><i><w/><s n="n"/><t/><d/></i></e> |
|||
</pardef> |
|||
<pardef n="SN"> |
|||
<e><par n="n"/></e> |
|||
<e><par n="adj"/><par n="n"/></e> |
|||
<e><par n="adj"/><par n="adj"/><par n="n"/></e> |
|||
</pardef> |
|||
</pardefs> |
|||
<section id="main" type="standard"> |
|||
<e lm="кабарда" c="хабар ет"> |
|||
<p><l>хабар<b/>ет<s n="v"/></l> |
|||
<r>хабар<s n="n"/><s n="nom"/><d/>ет<s n="v"/></r></p><i><t/><d/></i> |
|||
</e> |
|||
<e lm="абайла" c="абай бол"> |
|||
<p><l>абай<b/>бол<s n="v"/></l> |
|||
<r>абай<s n="adj"/><d/>бол<s n="v"/></r></p><i><t/><d/></i> |
|||
</e> |
|||
<e lm="абайла" c="абай бол"> |
|||
<p><l>абай<b/>бол<s n="v"/></l> |
|||
<r>абай<s n="adj"/><d/>бол<s n="v"/></r></p><i><t/>+ма<t/><d/></i> |
|||
<!-- p><l>абай<s n="adj"/><d/>бол<s n="v"/><t/></l> |
|||
<r>абай<b/>бол<s n="v"/><t/></r></p --> |
|||
</e> |
|||
<e lm="сууга түш" c="шомылда"> |
|||
<p><l>сууга<b/>түш<s n="v"/></l> |
|||
<r>суу<s n="n"/><s n="dat"/><d/>түш<s n="v"/></r></p><i><t/><d/></i> |
|||
</e> |
|||
</section> |
|||
$ lt-print fao-nob.autoseq.bin |
|||
</dictionary> |
|||
0 1 s s |
|||
</pre> |
|||
1 2 n n |
|||
2 3 j j |
|||
3 4 ú ú |
|||
4 5 g g |
|||
5 6 v v |
|||
6 7 a a |
|||
7 8 <vblex> # |
|||
8 9 ε |
|||
9 10 ε s |
|||
10 11 ε e |
|||
11 12 ε g |
|||
12 13 ε <vblex> |
|||
13 14 <ANY_TAG> <ANY_TAG> |
|||
14 14 <ANY_TAG> <ANY_TAG> |
|||
14 15 <$> <$> |
|||
15 16 s u |
|||
16 17 e m |
|||
17 18 g <pr> |
|||
18 19 <prn> ε |
|||
19 20 <ANY_TAG> ε |
|||
20 21 <$> ε |
|||
21 22 u ε |
|||
22 23 m ε |
|||
23 24 <pr> ε |
|||
24 25 <$> <$> |
|||
25 |
|||
and the following code to compile it (where <code>$(PREFIX1)</code> is kaz-kir and <code>$(PREFIX2)</code> is kir-kaz and <code>$(BASENAME)</code> is apertium-kaz-kir; the above file is apertium-kaz-kir.kir-kaz.lsx): |
|||
$ echo "^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin |
|||
^snjúgva<vblex><ind><pres><p3><sg>$ ^seg<prn><ref><acc>$ ^um<pr>$ |
|||
<pre> |
|||
<e lm="halda fram, at" c=""> |
|||
$(PREFIX1).autoseq.bin: $(BASENAME).$(PREFIX1).lsx |
|||
<p><l>halda<s n="vblex"/></l><r>halda<g><b/>fram</g><s n="vblex"/></r></p> |
|||
lsx-comp lr $< $@ |
|||
<i><t/><j/></i> |
|||
<p><l>fram<s n="adv"/><j/>,<s n="cm"/><j/>at<s n="cnjsub"/></l><r>,<s n="cm"/><j/>at<s n="cnjsub"/><j/></r></p> |
|||
<i><j/></i> |
|||
</e> |
|||
$(PREFIX2).autoseq.bin: $(BASENAME).$(PREFIX2).lsx |
|||
lsx-comp lr $< $@ |
|||
$(PREFIX1).revautoseq.bin: $(BASENAME).$(PREFIX1).lsx |
|||
$ echo "^at<cnjsub>$ ^*leidningarnir$ ^halda<vblex><inf>$ ^fram<adv>$^,<cm>$ ^at<cnjsub>$" | ~/source/apertium/branches/apertium-separable/src/lsx-proc fao-nob.autoseq.bin |
|||
lsx-comp rl $< $@ |
|||
^at<cnjsub>$ ^*leidningarnir$ ^halda# fram<vblex><inf>$^,<cm>$ ^at<cnjsub>$ ^$ |
|||
$(PREFIX2).revautoseq.bin: $(BASENAME).$(PREFIX2).lsx |
|||
lsx-comp rl $< $@ |
|||
</pre> |
|||
EXPECTED OUTPUT: |
|||
we expect lr compilation to give the following behaviour: |
|||
<pre> |
|||
$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin |
|||
^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$ |
|||
</pre> |
|||
and |
|||
<pre> |
|||
$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin |
|||
^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$ |
|||
</pre> |
</pre> |
||
WHEREAS with rl compilation (outputting with name revautoseq), we expect the following behaviour: |
|||
==Troubleshooting== |
|||
<pre> |
|||
$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin |
|||
^хабар ет<v><iv><ifi><p1><sg>$ |
|||
</pre> |
|||
and |
|||
<pre> |
|||
$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin |
|||
^хабар ет<v><iv><ifi><p1><sg>$ |
|||
</pre> |
|||
==See also== |
==See also== |
||
* [[Apertium system architecture]] |
* [[Apertium system architecture]] |
||
* GSOC project [[User:Irene/proposal | proposal]], [[User:Irene/workplan | workplan]], [[Lsx_module/report | report]] |
|||
==References== |
|||
* https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable |
|||
* project [[User:Irene/proposal | proposal]] and [[User:Irene/workplan | workplan]] |
|||
[[Category:Documentation in English]] |
[[Category:Documentation in English]] |
||
* [[/GCI_2017]] |
Latest revision as of 18:54, 2 May 2024
Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.
Installing[edit]
The module is part of the nightly repositories as apt-get install apertium-separable
.
If you'd like to compile it manually—e.g., for development purposes—you can follow these instructions:
Prerequisites and compilation are the same as lttoolbox and apertium. See Installation.
The code can be found at https://github.com/apertium/apertium-separable, and instructions for compiling the module are:
./autogen.sh ./configure make make install
You'll need lttoolbox
from git (or, greater than the current release 3.3.3) and associated libraries, and zlib
(debian: zlib1g-dev
).
Lexical transfer in the pipeline[edit]
lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer:
(note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.)
… | apertium-tagger -g en-es.prob | apertium-pretransfer | lsx-proc en-es.autoseq.bin | …
Usage[edit]
Creating the lsx-dictionary[edit]
The lsx dictionary format is largely similar to those of the morphological and bilingual dictionaries. (see also: Apertium_New_Language_Pair_HOWTO)
We begin with a declaration of the dictionary. There is currently nothing in it, only a declaration that we want to begin a new dictionary.
<dictionary type="separable"> </dictionary>
Then add the alphabet entry, this can be empty as the alphabet is only used for tokenisation and the lsx module comes after the text is tokenised. Now we have:
<dictionary type="separable"> <alphabet></alphabet> </dictionary>
Next we need to add the symbol definitions, abbreviated to sdefs. These are the symbols that your words are tagged with, e.g. noun or verb or adj. Again, you should be able to just copy the sdef section from your language's monodix, and it should contain many more than in this basic example.
<dictionary type="separable"> <alphabet></alphabet> <sdefs> <sdef n="adj"/> <sdef n="adv"/> <sdef n="n"/> <sdef n="sep"/> <sdef n="vblex"/> </sdefs> </dictionary>
Now we need to add the paradigm definitions, abbreviated to pardefs. These represent patterns of word orders. The following example represents words tagged as adjective, noun, noun phrase, and frequency adjectives. See the note below about the tags <w/>
, <t/>
, <j/>
. The lemma can be represented as anychars (<w/>
, such as in adj and n below; or by typing out the word itself, such as in freq-adv below. Pardefs can be used to create other pardefs, such as in SN below. Adding paradigms into the dictionary, we get:
<dictionary type="separable"> <alphabet></alphabet> <sdefs> ... </sdefs> <pardefs> <pardef n="adj"> <!-- to represent all adjectives --> <e><i><w/><s n="adj"/><d/></i></e> <!-- word only has the adj tag --> <e><i><w/><s n="adj"/><t/><d/></i></e> <!-- word has the adj tag followed by one or more other tags --> </pardef> <pardef n="n"> #to represent all nouns <e><i><w/><s n="n"/><t/><d/></i></e> <!-- word has the n tag followed by one or more other tags --> </pardef> <pardef n="SN"> #to represent all noun phrases <e><par n="n"/></e> <e><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of an adjective word followed by a noun word --> <e><par n="adj"/><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of two adjectives followed by a noun --> </pardef> <pardef n="freq-adv"> <e><i>always<s n="adv"/><d/></i></e> <!-- i.e. ^always<adv>$ --> <e><i>anually<s n="adv"/><d/></i></e> <e><i>bianually<s n="adv"/><d/></i></e> </pardef> </pardefs> </dictionary>
Finally, we add the main entries. Here is the final result of our small example dictionary:
<dictionary type="separable"> <alphabet></alphabet> <sdefs> <sdef n="adj"/> <sdef n="adv"/> <sdef n="n"/> <sdef n="sep"/> <sdef n="vblex"/> </sdefs> <pardefs> <pardef n="adj"> <e><i><w/><s n="adj"/><d/></i></e> <e><i><w/><s n="adj"/><t/><d/></i></e> </pardef> <pardef n="n"> <e><i><w/><s n="n"/><t/><d/></i></e> </pardef> <pardef n="SN"> <e><par n="n"/></e> <e><par n="adj"/><par n="n"/></e> <e><par n="adj"/><par n="adj"/><par n="n"/></e> </pardef> <pardef n="freq-adv"> <e><i>always<s n="adv"/><d/></i></e> <e><i>anually<s n="adv"/><d/></i></e> <e><i>bianually<s n="adv"/><d/></i></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="be late" c="llegar tarde"> <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/></r></p><i><t/><d/></i> <par n="SAdv"/><p><l>late<t/><d/></l><r></r></p> </e> <e lm="take away" c="sacar, quitar"> <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/></r></p><i><t/><d/></i> <par n="SN"/><p><l>away<t/><d/></l><r></r></p> </e> </section> </dictionary>
Note:
<w/>
stands for one or more alphabetic symbols<t/>
stands for one or more tags (multicharacter symbols).<d/>
stands for the word boundary symbol $
i.e.
<e><w/>
is equivalent to<t/><d/></e>any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$>
- ^tall<adj><sint><...>$
<e><w/>
is equivalent to<d/></e>any-one-or-more-chars<adj><$>
- ^tall<adj>$
A larger example dictionary can be found at https://github.com/apertium/apertium-separable/blob/master/examples/apertium-eng-spa.eng-spa.lsx.
The lsx dictionary file names are of the form apertium-A-B.A-B.lsx
, where apertium-A-B is the name of the language pair. For example, file apertium-eng-cat.eng-cat.lsx
is the lsx dictionary for the eng-cat
pair. The names of the compiled binaries are of the form apertium-A-B.autoseq.bin
. For example, eng-cat.autoseq.bin
.
Compilation[edit]
Compilation into the binary format is achieved by means of the lsx-comp program. Specifying lr as the mode will produce an analyser, and rl will produce a generator.
$ lsx-comp lr apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin main@standard 61 73
Processing[edit]
Processing can be done using the lsx-proc program.
The input to lsx-proc
is the output of apertium-tagger
and apertium-pretransfer
,
$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin ^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$
Example usages[edit]
Example #1: A sentence in plain text,
The Aragonese took Ramiro out of a monastery and made him king.
This is the output of feeding the sentence through apertium-tagger
and then apertium-pretransfer
:
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$
This is the output of feeding the output above through lsx-proc
with apertium-eng-spa.eng-spa.lsx:
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$
Matching forms[edit]
You can also use lsx-proc on readings that include forms. Run apertium-tagger
with -p
to ensure forms are not stripped off, and run lsx-proc
with -p
to enable analysing forms. Use <f/>
to match the /
between form and reading:
$ cat sep.lsx <?xml version="1.0" encoding="UTF-8"?> <dictionary type="separable"> <alphabet></alphabet> <sdefs> <sdef n="np"/> <sdef n="pr"/> <sdef n="vblex"/> <sdef n="adv"/> </sdefs> <pardefs> <pardef n="reading" c="match and keep readings (incl. tagless/unknown). Includes end delimiter"> <e> <i><f/><w/><d/></i> </e> <e> <i><f/><w/><t/><d/></i> </e> </pardef> <pardef n="reading:" c="match and drop readings (incl. tagless/unknown). Includes end delimiter"> <e><p><l><f/><w/><d/></l> <r/></p></e> <e><p><l><f/><w/><t/><d/></l><r/></p></e> </pardef> <pardef n="pr|jf" c="includes end delimiter"> <e><i><w/><f/><w/><s n="pr"/><t/><d/></i></e> <e><i>jf.</i> <par n="reading"/></e> </pardef> </pardefs> <section id="main" type="standard"> <e c="merge"> <par n="pr|jf"/> <p><l>lov</l> <r></r></p> <par n="reading:"/> <p><l>om</l> <r></r></p> <par n="reading:"/> <p><l>kake</l> <r></r></p> <par n="reading:"/> <p><l></l> <r>lov<b/>om<b/>kake<s n="np"/><d/></r></p> </e> <e c="split"> <p><l>wouldnae</l> <r></r></p> <par n="reading:"/> <p><l></l> <r>would<f/>will<s n="vblex"/><d space="no"/></r></p> <p><l></l> <r>nae<f/>not<s n="adv"/><d/></r></p> </e> <!-- The "reading" pardefs above will match all readings (even unknowns), so they really just filter on form. But we could easily match on lemmas/tags as well, e.g the below entry would turn ^aint/benot<vblex><adv>$ into ^ai/be<vblex>$ ^nt/not<adv>$ but would not match ^aint/havenot<vblex><adv>$ --> <e c="require certain lemma/tags"> <p><l>aint<f/>benot<s n="vblex"/><s n="adv"/><d/></l> <r></r></p> <p><l></l> <r>ai<f/>be<s n="vblex"/><d space="no"/></r></p> <p><l></l> <r>nt<f/>not<s n="adv"/><d/></r></p> </e> </section> </dictionary> $ lsx-comp lr sep.lsx sep.bin main@standard 99 118 $ echo '^jf./jf.<pr>$ ^lov/lov<n><m>$ ^om/om<pr>$ ^kake/kake<n><f>$' | lsx-proc -w -p sep.bin ^jf./jf.<pr>$ ^lov/lov om kake<np>$ $ echo '^wouldnae/willnot<vblex><adv>$' | lsx-proc -w -p sep.bin ^would/will<vblex>$^nae/not<adv>$ $ echo '^wouldnae/*wouldnae$' | lsx-proc -w -p sep.bin ^would/will<vblex>$^nae/not<adv>$ $ echo '^aint/benot<vblex><adv>$' | lsx-proc -w -p sep.bin ^ai/be<vblex>$^nt/not<adv>$ $ echo '^aint/havenot<vblex><adv>$' | lsx-proc -w -p sep.bin ^aint/havenot<vblex><adv>$
NB: If you are using HFST to create your lsx binary, you will need to run lt-comp
with the -S
option on your ATT file, e.g.
lt-comp -S lr sep.att sep.bin
Troubleshooting[edit]
Undefined symbol[edit]
In your dictionary you are probably using a symbol that you didn't define in the sdefs. Add the symbol to the sdefs.
Future work[edit]
Offloading multiwords from transducers to lsx[edit]
In theory we're offloading multiwords from the transducers to lsx. This leaves open some questions:
- how do we do N N compounds with lsx?
- how does translation to a multiword work? In theory it's possible to invert the transducer, but an attempt to try this results in a transducer that looks right but silently fails to apply to input. Also, it will need to be able to handle the output of transfer. —Firespeaker (talk) 00:02, 1 September 2017 (CEST)
Recycling dictionaries and/or paradigms[edit]
lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity?
Beta testing[edit]
Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module:
- eng-cat
- eng-deu (?)
- kaz-kir
Beta test with more language pairs
Transfer-like super powers[edit]
- Transfer-like capabilities for the lexicon (super powers). E.g., gustar / like
The one-to-many bug[edit]
Given the following lsx file:
<dictionary type="sequential"> <alphabet>АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҰҮФХҺЦЧШЩЬЫЪЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуұүфхһцчшщьыъэюя</alphabet> <sdefs> <sdef n="adj"/> <sdef n="adv"/> <sdef n="n"/> <sdef n="nom"/> <sdef n="dat"/> <sdef n="v"/> </sdefs> <pardefs> <pardef n="adj"> <e><i><w/><s n="adj"/><d/></i></e> <e><i><w/><s n="adj"/><t/><d/></i></e> </pardef> <pardef n="n"> <e><i><w/><s n="n"/><t/><d/></i></e> </pardef> <pardef n="SN"> <e><par n="n"/></e> <e><par n="adj"/><par n="n"/></e> <e><par n="adj"/><par n="adj"/><par n="n"/></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="кабарда" c="хабар ет"> <p><l>хабар<b/>ет<s n="v"/></l> <r>хабар<s n="n"/><s n="nom"/><d/>ет<s n="v"/></r></p><i><t/><d/></i> </e> <e lm="абайла" c="абай бол"> <p><l>абай<b/>бол<s n="v"/></l> <r>абай<s n="adj"/><d/>бол<s n="v"/></r></p><i><t/><d/></i> </e> <e lm="абайла" c="абай бол"> <p><l>абай<b/>бол<s n="v"/></l> <r>абай<s n="adj"/><d/>бол<s n="v"/></r></p><i><t/>+ма<t/><d/></i> <!-- p><l>абай<s n="adj"/><d/>бол<s n="v"/><t/></l> <r>абай<b/>бол<s n="v"/><t/></r></p --> </e> <e lm="сууга түш" c="шомылда"> <p><l>сууга<b/>түш<s n="v"/></l> <r>суу<s n="n"/><s n="dat"/><d/>түш<s n="v"/></r></p><i><t/><d/></i> </e> </section> </dictionary>
and the following code to compile it (where $(PREFIX1)
is kaz-kir and $(PREFIX2)
is kir-kaz and $(BASENAME)
is apertium-kaz-kir; the above file is apertium-kaz-kir.kir-kaz.lsx):
$(PREFIX1).autoseq.bin: $(BASENAME).$(PREFIX1).lsx lsx-comp lr $< $@ $(PREFIX2).autoseq.bin: $(BASENAME).$(PREFIX2).lsx lsx-comp lr $< $@ $(PREFIX1).revautoseq.bin: $(BASENAME).$(PREFIX1).lsx lsx-comp rl $< $@ $(PREFIX2).revautoseq.bin: $(BASENAME).$(PREFIX2).lsx lsx-comp rl $< $@
EXPECTED OUTPUT:
we expect lr compilation to give the following behaviour:
$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin ^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$
and
$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin ^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$
WHEREAS with rl compilation (outputting with name revautoseq), we expect the following behaviour:
$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin ^хабар ет<v><iv><ifi><p1><sg>$
and
$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin ^хабар ет<v><iv><ifi><p1><sg>$
See also[edit]
- Apertium system architecture
- GSOC project proposal, workplan, report
- /GCI_2017