https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Thatprogrammer&feedformat=atomApertium - User contributions [en]2024-03-29T02:00:55ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Apertium_separable&diff=64927Apertium separable2017-12-13T01:12:24Z<p>Thatprogrammer: /* Compilation */ Added info on reverse compilation</p>
<hr />
<div>{{TOCD}}<br />
<br />
[[Lttoolbox]] provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.<br />
<br />
==Installing==<br />
Prerequisites and compilation are the same as lttoolbox and apertium. See [[Installation]]. On Debian/Ubuntu derivatives, it is part of the nightly repo as <code>apt-get install apertium-separable</code>.<br />
<br />
The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable and instructions for compiling the module are:<br />
<br />
<pre><br />
./autogen.sh<br />
./configure<br />
make<br />
make install<br />
</pre><br />
<br />
You'll need lttoolbox from SVN (or, greater than the current release 3.3.3) and associated libraries, and zlib (debian: zlib1g-dev).<br />
<br />
<s>It is not currently part of distributed Apertium binaries for other distros/OSs.</s> It is now available via the nightly repositories as the <code>apertium-separable</code> module.<br />
<br />
==Lexical transfer in the pipeline==<br />
lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer: <br/><br />
(note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.)<br />
<br />
<pre><br />
… | apertium-tagger -g en-es.prob | apertium-pretransfer | lsx-proc en-es.autoseq.bin | …<br />
</pre><br />
<br />
==Usage==<br />
<br />
===Creating the lsx-dictionary===<br />
The lsx dictionary format is largely similar to those of the [[Morphological_dictionary | morphological]] and [[Bilingual_dictionary | bilingual]] dictionaries. (see also: [[Apertium_New_Language_Pair_HOWTO]])<br />
<br />
We begin with a declaration of the dictionary. There is currently nothing in it, only a declaration that we want to begin a new dictionary.<br />
<pre><br />
<dictionary type="separable"><br />
</dictionary><br />
</pre><br />
<br />
Then add the alphabet entry, this can be empty as the alphabet is only used for tokenisation and the lsx module comes after the text is tokenised. Now we have:<br />
<pre><br />
<dictionary type="separable"><br />
<alphabet></alphabet> <br />
</dictionary><br />
</pre><br />
<br />
Next we need to add the symbol definitions, abbreviated to sdefs. These are the symbols that your words are tagged with, e.g. noun or verb or adj. Again, you should be able to just copy the sdef section from your language's monodix, and it should contain many more than in this basic example.<br />
<pre><br />
<dictionary type="separable"><br />
<alphabet></alphabet><br />
<sdefs><br />
<sdef n="adj"/><br />
<sdef n="adv"/><br />
<sdef n="n"/><br />
<sdef n="sep"/><br />
<sdef n="vblex"/><br />
</sdefs><br />
</dictionary><br />
</pre><br />
<br />
Now we need to add the paradigm definitions, abbreviated to pardefs. These represent patterns of word orders. The following example represents words tagged as adjective, noun, noun phrase, and frequency adjectives. See the note below about the tags {{tag|w/}}, {{tag|t/}}, {{tag|j/}}. The lemma can be represented as anychars ({{tag|w/}}, such as in adj and n below; or by typing out the word itself, such as in freq-adv below. Pardefs can be used to create other pardefs, such as in SN below. Adding paradigms into the dictionary, we get: <br />
<pre><br />
<dictionary type="separable"><br />
<alphabet></alphabet><br />
<sdefs><br />
...<br />
</sdefs><br />
<pardefs><br />
<pardef n="adj"> <!-- to represent all adjectives --><br />
<e><i><w/><s n="adj"/><j/></i></e> <!-- word only has the adj tag --><br />
<e><i><w/><s n="adj"/><t/><j/></i></e> <!-- word has the adj tag followed by one or more other tags --><br />
</pardef><br />
<pardef n="n"> #to represent all nouns<br />
<e><i><w/><s n="n"/><t/><j/></i></e> <!-- word has the n tag followed by one or more other tags --><br />
</pardef><br />
<pardef n="SN"> #to represent all noun phrases<br />
<e><par n="n"/></e><br />
<e><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of an adjective word followed by a noun word --><br />
<e><par n="adj"/><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of two adjectives followed by a noun --><br />
</pardef><br />
<pardef n="freq-adv"><br />
<e><i>always<s n="adv"/><j/></i></e> <!-- i.e. ^always<adv>$ --><br />
<e><i>anually<s n="adv"/><j/></i></e><br />
<e><i>bianually<s n="adv"/><j/></i></e><br />
</pardef><br />
</pardefs><br />
</dictionary><br />
</pre><br />
<br />
Finally, we add the main entries. Here is the final result of our small example dictionary:<br />
<br />
<pre><br />
<dictionary type="separable"><br />
<alphabet></alphabet><br />
<sdefs><br />
<sdef n="adj"/><br />
<sdef n="adv"/><br />
<sdef n="n"/><br />
<sdef n="sep"/><br />
<sdef n="vblex"/><br />
</sdefs><br />
<pardefs><br />
<pardef n="adj"><br />
<e><i><w/><s n="adj"/><j/></i></e><br />
<e><i><w/><s n="adj"/><t/><j/></i></e><br />
</pardef><br />
<pardef n="n"><br />
<e><i><w/><s n="n"/><t/><j/></i></e><br />
</pardef><br />
<pardef n="SN"><br />
<e><par n="n"/></e><br />
<e><par n="adj"/><par n="n"/></e><br />
<e><par n="adj"/><par n="adj"/><par n="n"/></e><br />
</pardef><br />
<pardef n="freq-adv"><br />
<e><i>always<s n="adv"/><j/></i></e><br />
<e><i>anually<s n="adv"/><j/></i></e><br />
<e><i>bianually<s n="adv"/><j/></i></e><br />
</pardef><br />
</pardefs><br />
<section id="main" type="standard"><br />
<e lm="be late" c="llegar tarde"><br />
<p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/></r></p><i><t/><j/></i><br />
<par n="SAdv"/><p><l>late<t/><j/></l><r></r></p><br />
</e><br />
<e lm="take away" c="sacar, quitar"><br />
<p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/></r></p><i><t/><j/></i><br />
<par n="SN"/><p><l>away<t/><j/></l><r></r></p><br />
</e><br />
</section><br />
</dictionary><br />
</pre><br />
<br />
Note:<br />
<br />
* {{tag|w/}} stands for one or more alphabetic symbols<br />
* {{tag|t/}} stands for one or more tags (multicharacter symbols).<br />
* {{tag|j/}} stands for the word boundary symbol $<br />
<br />
i.e.<br />
* <code> <e><i><w/><s n="adj"/><t/><j/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><required-anytag><...optional-anytag...><$> </code><br />
** ^tall<adj><sint><...>$<br />
* <code> <e><i><w/><s n="adj"/><j/></i></e> </code> is equivalent to <code> any-one-or-more-chars<adj><$> </code><br />
** ^tall<adj>$<br />
<br />
A larger example dictionary can be found at https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/apertium-eng-spa.eng-spa.lsx<br />
<br />
The lsx dictionary file names are of the form <code> apertium-A-B.A-B.lsx </code>, where apertium-A-B is the name of the language pair. For example, file <code>apertium-eng-cat.eng-cat.lsx</code> is the lsx dictionary for the <code> eng-cat </code> pair. The names of the compiled binaries are of the form <code> apertium-A-B.autoseq.bin </code>. For example, <code> eng-cat.autoseq.bin </code>.<br />
<br />
===Compilation===<br />
Compilation into the binary format is achieved by means of the lsx-comp program. Specifying lr as the mode will produce an analyser, and rl will produce a generator.<br />
<br />
<pre><br />
$ lsx-comp lr apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin<br />
main@standard 61 73<br />
</pre><br />
<br />
===Processing===<br />
Processing can be done using the lsx-proc program.<br />
<br />
The input to <code> lsx-proc </code> is the output of <code> apertium-tagger </code> and <code> apertium-pretransfer </code>,<br />
<br />
<pre><br />
$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin<br />
^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$<br />
</pre><br />
<br />
===Example usages===<br />
Example #1: <br />
A sentence in plain text,<br />
<pre><br />
The Aragonese took Ramiro out of a monastery and made him king.<br />
</pre><br />
<br />
This is the output of feeding the sentence through <code> apertium-tagger </code> and then <code> apertium-pretransfer </code>:<br />
<pre><br />
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$<br />
</pre><br />
<br />
This is the output of feeding the output above through <code> lsx-proc </code> with apertium-eng-spa.eng-spa.lsx:<br />
<pre><br />
^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$ ^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$<br />
</pre><br />
<br />
==Troubleshooting==<br />
===Segmentation fault===<br />
Segmentation fault upon compilation or usage <br/><br />
The lsx-dictionary compiles fine with zero entries but gives a seg fault once entries are added <br/><br />
<br />
...no solution found yet <br/><br />
something is not updated or something in the makefile (?)<br />
<br />
make sure that the makefile ...<br />
<br />
===Complaints about step_override()===<br />
svn update in lttoolbox (and do make, make install) <br/><br />
You'll need an up-to-date version of lttoolbox and associated libraries, and zlib (debian: zlib1g-dev). <br/><br />
<br />
===Undefined symbol===<br />
In your dictionary you are probably using a symbol that you didn't define in the sdefs. Add the symbol to the sdefs.<br />
<br />
==Future work==<br />
=== Offloading multiwords from transducers to lsx ===<br />
In theory we're offloading multiwords from the transducers to lsx. This leaves open some questions:<br />
* how do we do N N compounds with lsx?<br />
* how does translation ''to'' a multiword work? In theory it's possible to invert the transducer, but an attempt to try this results in a transducer that looks right but silently fails to apply to input. Also, it will need to be able to handle the output of transfer. —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 00:02, 1 September 2017 (CEST)<br />
=== Recycling dictionaries and/or paradigms ===<br />
lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity?<br />
<br />
=== Beta testing ===<br />
Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module:<br />
** eng-cat<br />
** eng-deu (?)<br />
** kaz-kir<br />
Beta test with more language pairs<br />
<br />
=== Transfer-like super powers ===<br />
* Transfer-like capabilities for the lexicon (super powers). E.g., gustar / like<br />
<br />
=== Inheritance and clean up code ===<br />
* [[https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/src/lsx_compiler.cc lsx-comp]] was hitched from the lttoolbox/compiler. In the future we may want to integrate it back and have lsx-comp and lsx-proc inherit directly from lttoolbox (?) <br />
<br />
=== The one-to-many bug ===<br />
Given the following lsx file:<br />
<pre><br />
<dictionary type="sequential"><br />
<alphabet>АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҰҮФХҺЦЧШЩЬЫЪЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуұүфхһцчшщьыъэюя</alphabet><br />
<sdefs><br />
<sdef n="adj"/><br />
<sdef n="adv"/><br />
<sdef n="n"/><br />
<sdef n="nom"/><br />
<sdef n="dat"/><br />
<sdef n="v"/><br />
</sdefs><br />
<pardefs><br />
<pardef n="adj"><br />
<e><i><w/><s n="adj"/><j/></i></e><br />
<e><i><w/><s n="adj"/><t/><j/></i></e><br />
</pardef><br />
<pardef n="n"><br />
<e><i><w/><s n="n"/><t/><j/></i></e><br />
</pardef><br />
<pardef n="SN"><br />
<e><par n="n"/></e><br />
<e><par n="adj"/><par n="n"/></e><br />
<e><par n="adj"/><par n="adj"/><par n="n"/></e><br />
</pardef><br />
</pardefs><br />
<section id="main" type="standard"><br />
<e lm="кабарда" c="хабар ет"><br />
<p><l>хабар<b/>ет<s n="v"/></l><br />
<r>хабар<s n="n"/><s n="nom"/><j/>ет<s n="v"/></r></p><i><t/><j/></i><br />
</e><br />
<e lm="абайла" c="абай бол"><br />
<p><l>абай<b/>бол<s n="v"/></l><br />
<r>абай<s n="adj"/><j/>бол<s n="v"/></r></p><i><t/><j/></i><br />
</e><br />
<e lm="абайла" c="абай бол"><br />
<p><l>абай<b/>бол<s n="v"/></l><br />
<r>абай<s n="adj"/><j/>бол<s n="v"/></r></p><i><t/>+ма<t/><j/></i><br />
<!-- p><l>абай<s n="adj"/><j/>бол<s n="v"/><t/></l><br />
<r>абай<b/>бол<s n="v"/><t/></r></p --><br />
</e><br />
<e lm="сууга түш" c="шомылда"><br />
<p><l>сууга<b/>түш<s n="v"/></l><br />
<r>суу<s n="n"/><s n="dat"/><j/>түш<s n="v"/></r></p><i><t/><j/></i><br />
</e><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
<br />
and the following code to compile it (where <code>$(PREFIX1)</code> is kaz-kir and <code>$(PREFIX2)</code> is kir-kaz and <code>$(BASENAME)</code> is apertium-kaz-kir; the above file is apertium-kaz-kir.kir-kaz.lsx):<br />
<br />
<pre><br />
$(PREFIX1).autoseq.bin: $(BASENAME).$(PREFIX1).lsx<br />
lsx-comp $< $@<br />
<br />
$(PREFIX2).autoseq.bin: $(BASENAME).$(PREFIX2).lsx<br />
lsx-comp $< $@<br />
<br />
$(PREFIX1).revautoseq.bin: $(BASENAME).$(PREFIX1).lsx<br />
lt-print $(PREFIX1).autoseq.bin | sed 's/ /@_SPACE_@/g' > $(PREFIX1).autoseq.att<br />
hfst-txt2fst -e ε < $(PREFIX1).autoseq.att > $(PREFIX1).autoseq.hfst<br />
hfst-invert $(PREFIX1).autoseq.hfst | hfst-minimise > $(PREFIX1).revautoseq.hfst<br />
hfst-fst2txt $(PREFIX1).revautoseq.hfst | gzip -9 -c -n > $(PREFIX1).revautoseq.att.gz<br />
zcat < $(PREFIX1).revautoseq.att.gz > $(PREFIX1).revautoseq.att<br />
sed 's/@0@/ε/g' $(PREFIX1).revautoseq.att > $(PREFIX1).revautoseq.1.att<br />
lt-comp lr $(PREFIX1).revautoseq.1.att $@<br />
<br />
<br />
$(PREFIX2).revautoseq.bin: $(BASENAME).$(PREFIX2).lsx<br />
lt-print $(PREFIX2).autoseq.bin | sed 's/ /@_SPACE_@/g' > $(PREFIX2).autoseq.att<br />
hfst-txt2fst -e ε < $(PREFIX2).autoseq.att > $(PREFIX2).autoseq.hfst<br />
hfst-invert $(PREFIX2).autoseq.hfst | hfst-minimise > $(PREFIX2).revautoseq.hfst<br />
hfst-fst2txt $(PREFIX2).revautoseq.hfst | gzip -9 -c -n > $(PREFIX2).revautoseq.att.gz<br />
zcat < $(PREFIX2).revautoseq.att.gz > $(PREFIX2).revautoseq.att<br />
sed 's/@0@/ε/g' $(PREFIX2).revautoseq.att > $(PREFIX2).revautoseq.1.att<br />
lt-comp lr $(PREFIX2).revautoseq.1.att $@<br />
<br />
</pre><br />
<br />
$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin<br />
<br />
does not seem to work. The expected output is <code>^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$</code>. The reverse does seem to work:<br />
<br />
$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin<br />
<br />
Indeed outputs <code>^хабар ет<v><iv><ifi><p1><sg>$</code>.<br />
<br />
==See also==<br />
* https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable<br />
* [[Apertium system architecture]]<br />
* GSOC project [[User:Irene/proposal | proposal]], [[User:Irene/workplan | workplan]], [[Lsx_module/report | report]]<br />
[[Category:Documentation in English]]<br />
* [[/GCI_2017]]</div>Thatprogrammerhttps://wiki.apertium.org/w/index.php?title=Multiwords&diff=64896Multiwords2017-12-10T05:17:04Z<p>Thatprogrammer: Grammar change</p>
<hr />
<div>The term '''multiword''' includes simple words that have spaces in them, words with separable parts, contractions and compounds of several lemmas. Apertium supports these to varying degrees.<br />
<br />
[[Multi-mots|En français]]<br />
<br />
{{TOCD}}<br />
<br />
==Overview==<br />
[[lttoolbox]] currently has four mechanisms for creating multiwords, of varying complexity:<br />
# '''<code><b/></code>''' simply inserts a blank; use it if you want a word that has a space in it, but only inflection at the end<br />
#* <pre>entry: <e><i>record<b/>player</i><par n="house__n"/></e></pre><br />
#* <pre>analysis: ^record player/record player<n><sg>$</pre><br />
#* <pre>analysis: ^record players/record player<n><pl>$</pre><br />
# '''<code><g/></code>''' is used (in combination with <code><b/></code>) when you have inflection in the middle of the word, and an invariant part at the end<br />
#* <pre>entry: <e><i>coffee</i><par n="house__n"/><p><l><b/>with<b/>milk</l><r><g><b/>with<b/>milk</g></r></p></e></pre><br />
#* <pre>analysis: ^coffee with milk/coffee<n><sg># with milk$</pre><br />
#* <pre>after disambiguation and pre-transfer: ^coffee# with milk<n><sg>$</pre><br />
#* <pre>analysis: ^coffees with milk/coffee<n><pl># with milk$</pre><br />
#* <pre>after disambiguation and pre-transfer: ^coffee# with milk<n><pl>$</pre><br />
#* So the tags are "in the ''middle''", right after the inflection, in the analyser, but appears ''after'' the whole lemma (including <code><g/></code> group) in the bidix<br />
# '''<code><j/></code>''' is used when you want the multiword to be split into two lexical units, each with its own analysis (set of tags), where both parts may vary independently<br />
#* <pre>entry: <e>wr</i><par n="wr/ite__vblex"/><p><l><b/>about</l><r><j/>about<s n="pr"/></r></p></e></pre><br />
#* <pre>analysis: ^write about/write<vblex><inf>+about<pr>/write<vblex><pres>+about<pr>$</pre><br />
#* <pre>after disambiguation and pre-transfer: ^write<vblex><inf>$ ^about<pr>$</pre><br />
#* <pre>analysis: ^writes about/write<vblex><pri><p3><sg>+about<pr>$</pre><br />
#* <pre>after disambiguation and pre-transfer: ^write<vblex><pri><p3><sg>$ ^about<pr>$</pre><br />
# '''<code>&lt;s n="compound-only-L"/&gt;</code>''' and '''<code>&lt;s n="compound-R"/&gt;</code>''' – an analysis with the compound-only-L tag in it can be the left part of a compound (many of these can chain), but can never stand alone as an analysis, while an analysis with the compound-R tag in it can be either a word on its own, or the final part of a compound.<br />
#* <pre>entry: <e><p><l>kaffe</l><r>kaffe<s n="n"/><s n="m"/><s n="sg"/><s n="ind"/><s n="cmp"/><s n="compound-only-L"/></r></p></pre><br />
#* <pre>entry: <e><p><l>bilet</l><r>bilete<s n="n"/><s n="nt"/><s n="sg"/><s n="ind"/><s n="cmp"/><s n="compound-only-L"/></r></p> </pre><br />
#* <pre>entry: <e><p><l>kostnaden</l><r>kostnad<s n="n"/><s n="m"/><s n="sg"/><s n="def"/><s n="compound-R"/></r></p> </pre><br />
#* <pre>analysis: ^kaffekostnaden/kaffe<n><m><sg><ind><cmp>+kostnad<n><m><sg><def>$</pre><br />
#* <pre>analysis: ^kaffebiletkostnaden/kaffe<n><m><sg><ind><cmp>+bilet<n><nt><sg><ind><cmp>+kostnad<n><m><sg><def>$</pre><br />
#* <pre>no analysis: ^bilet/*bilet$</pre><br />
<br />
<br />
More information on these below under [[Multiwords#Simple_usage|Simple usage]], [[Compounds]] and the [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf documentation] (esp. sec.3.1.2.6).<br />
<br />
<br />
The following multiwords are in the process of being supported:<br />
* [[User:Skh/Application_GSoC_2010|''Agreement multiwords'':]] complex multiwords where two or more parts show some sort of agreement/dependence of tags (or, where certain tag combinations are illegal)<br />
** lt-mwpp takes a file which specifies which lemma combinations are multiwords, and what tags need to agree, and generates all the legal combinations in the lttoolbox dix format<br />
* [[User:Irene/proposal|''Discontiguous multiwords'':]] multiwords with an arbitrary number of unrelated words in between, eg. the separable verbs in Germanic languages<br />
(but see hacks below)<br />
<br />
== Why make a multiword entry? ==<br />
The first thing to say about multiword translations is that they can sometimes be handled by knowing only the [[Monodix basics]] and [[Bilingual dictionary]] basics (and also the superblank '<b/>').<br />
<br />
The English verb 'roll' is often supplemented with a word for orientation/direction e.g. 'to roll over' or 'to roll down'. Within the Apertium system, we would like to treat these as 'many (two) words that make a single verb'. You might find this many-word verb in a sentence like 'the car rolled down the hill'. Perhaps a linguist may wish to study word-constructions like this further, but recognising 'roll down' as one word is a good step forward in machine translation.<br />
<br />
We can do this by creating an English monodix paradigm. Please note that this example is a little contrived, as the same effect can be made with no more than a section entry (however, if we show a paradigm the example will work even if the multiword verb was more difficult),<br />
<br />
<pre><br />
<pardef n="roll_down__vblex"><br />
<e><br />
<p><l><b/>down</l><r><b/>down<s n="vblex"><s n="inf"></r></p><br />
</e><br />
<e><br />
<p><l><b/>down</l><r><b/>down<s n="vblex"><s n="imp"></r></p><br />
</e><br />
<e><br />
<p><l><b/>down</l><r><b/>down<s n="vblex"><s n="pp"></r></p><br />
</e><br />
...<br />
</pardef><br />
</pre><br />
<br />
Note how the superblank '<b/>' is used to mark the limits of the words in the multiword verb.<br />
<br />
We can use/trigger the multiword paradigm from an English monodix 'section' entry,<br />
<br />
<pre><br />
<e lm="roll down"><i>roll</i><par n="roll_down__vblex"></e><br />
</pre><br />
<br />
Now, if another language needs to identify 'roll down' as a special verb, the above definition can be triggered from a bidex,<br />
<br />
<pre><br />
<e><p><l>???lemma from another language???</l><r>roll</b>down<s n="vblex"></r></p></e><br />
</pre><br />
<br />
Note the use of the superblank '<b/>' again, this time to construct the lemma.<br />
<br />
This is a surprisingly easy and clear way to construct multiword recognition. At the time of writing, you can find examples of this method in dictionaries on Apertium. This is possibly due to the ease and clarity, or because the dictionary entries are old.<br />
<br />
So you may ask, 'why not let Apertium treat these two words as separate words?'. Apertium is a flexible system :) If you need to get the effect, various bidix/monodix entries, or a rule in the first stage of the chunker module will work.<br />
<br />
But the stream you will generate will be something like (simplified, 'chunker' stage),<br />
<br />
<pre><br />
{roll<vblex><imp>}{down<at_pr>}<br />
</pre><br />
<br />
and does not reflect the connection of the two words. We would prefer a stream that looked like,<br />
<br />
<pre><br />
{roll down<vblex><imp>}<br />
</pre><br />
<br />
The lack of connection between the words may limit us later. We will not be able to identify the two words as one unit when translating back from English. We may be using the chunker for simple connection rules, which is not what the chunker is for and makes our translation pairs confusing to read. If we want to do further manipulations on the text stream, in either direction, tracing the effects will become harder and harder. If we have used the chunker we may find it difficult to use the chunker for other purposes.<br />
<br />
By all means patch to make some progress, but this is not a good end solution.<br />
<br />
== Simple usage ==<br />
<br />
=== Simple usage of &lt;g/&gt; and &lt;b/&gt; ===<br />
The first thing to say about this solution is that it may need a transfer rule. Without the rule, translation text may be lost. If you start adding any multiword rules, it is a good idea to have the transfer rule in place. The rule is very general and can be added to most language pairs.<br />
<br />
There is an example from English to Esperanto with '''inner inflection''' followed by an invariant part with spaces.<br />
<br />
In en.dix is<br />
<pre><br />
<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e><br />
<e lm="become acquainted"><br />
<i>bec</i><br />
<par n="bec/ome__vblex"/><br />
<p><br />
<l><b/>acquainted</l> <br />
<r><g><b/>acquainted</g></r> <br />
</p><br />
</e><br />
<e lm="become acquainted with"><br />
<i>bec</i><br />
<par n="bec/ome__vblex"/><br />
<p><br />
<l><b/>acquainted<b/>with</l><br />
<r><g><b/>acquainted<b/>with</g></r><br />
</p><br />
</e><br />
</pre><br />
So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <code>&lt;b/&gt;</code> is a space (blank) and that the fixed words are inside <code><g> </g></code>.<br />
<br />
When "become acquainted" is read from the analyser, the output is<br />
<pre><br />
^become acquainted/become<vblex><inf># acquainted$<br />
</pre><br />
<br />
Before lexical transfer, the "lemma queue" (<code># acquainted</code>) is put onto the lemma:<br />
<pre><br />
^become# acquainted<vblex><inf>$<br />
</pre><br />
<br />
In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are<br />
<pre><br />
<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e><br />
<e><p><l>konatiĝi<s n="vblex"/></l><r>become<g><b/>acquainted</g><s n="vblex"/></r></p></e><br />
<e><p><l>konatiĝi<g><b/>kun</g><s n="vblex"/></l><r>become<g><b/>acquainted<b/>with</g><s n="vblex"/></r></p></e><br />
</pre><br />
<br />
And the eo monodix<br />
<pre><br />
<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e><br />
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e><br />
<e lm="konatiĝi kun"><br />
<i>konatiĝ</i><br />
<par n="verb__vblex"/><br />
<p><br />
<l><b/>kun</l><br />
<r><g><b/>kun</g></r><br />
</p><br />
</e> <br />
</pre><br />
<br />
Note how the English fixed words <code><g>&lt;b/&gt;acquainted<b/>with</g></code> become <code><g>&lt;b/&gt;kun</g></code><br />
<br />
<br />
==== The action of <g></g> tags ====<br />
The action of <g></g> in the bidex is important. It enables us to keep the verb (or other word) in an inflection position.<br />
<br />
You may think you do it like this, putting the tags next to the relevant word,<br />
<br />
<pre><br />
<e><p><l>???lemma from another language???</l><r>roll<s n="vblex"></b>down</r></p></e><br />
</pre><br />
<br />
Compare this to the simple example in [[#Why make a multiword entry?]]. The verb mark is next to the word it is changing/inflecting, and so can use the generic paradigm for 'roll' (which is 'accept__vblex'), not a dedicated paradigm. This is a step forward.<br />
<br />
However, this produces results we do not want. It recognises the output, but truncates like this,<br />
<br />
<pre><br />
rolls<br />
</pre><br />
<br />
The manual has no explanation for why the bidex parser fails on a lemma containing interfering tags. The parser can process superbanks? Does it process up to the <vblex> tag then abandon work? If so, it would produce the results we see, seeking a 'roll<vblex>' lemma, failing to detect a 'roll down' lemma, and failing to write 'over'.<br />
<br />
What the manual makes clear is that extra words need shifting backwards in the stream to precede the final tags. And that this happens in the 'pretransfer' stage. And that the 'g' tag provides the hint. So the following code will produce our intended lemma. It places the informative tags at the end of the bidex entry. It signals the word 'down' is invariant, so the trailing <vblex> must refer to the text 'roll',<br />
<br />
<pre><br />
e><p><l>???lemma from another language???</l><r>roll<g></b>down</g><s n="vblex"></r></p></e><br />
</pre><br />
<br />
We need to do the same in a mono-dictionary. If we do not, the result will again go unrecognised. So, to enable multiword verb translation,<br />
<br />
<pre><br />
<e lm="roll down"><br />
<i>roll</i><par n="accept__vblex"><br />
<p><br />
<l><b/>down</l><br />
<r><g><b/>down</g></r><br />
</p><br />
</e><br />
</pre><br />
<br />
For a monodix, this is a long and complex line. But, you will remember, this method avoids creating a massive and non-reusable paradigm.<br />
<br />
<br />
==== The translation rule ====<br />
Before we have a translation, there is a last step. The 'g' tags may solve the problem of the unrecognised lemma, but they do nothing with the tags attached to the invariant words. The post-translation stream for 'roll down' may be,<br />
<br />
<pre><br />
^roll# down<vblex><imp>$<br />
</pre><br />
<br />
But the new monodix entry uses the 'g' tag to signal an invariant addition (there may be shortcuts for some situations? R.C.). We asked the bidex to have the tags at the end so Apertium would recognise and trigger the lemma we intended. Now Apertium needs the invariant part, ''after'' the morphological tags:<br />
<br />
<pre><br />
^roll<vblex><imp># down$<br />
</pre><br />
<br />
From the manual the transfer rule (in a .t1x file) is this,<br />
<br />
<pre><br />
<rule comment="VBLEX"><br />
<pattern><br />
<pattern-item n="vblex"/><br />
</pattern><br />
<action><br />
<out><br />
<lu> <br />
<clip pos="1" side="tl" part="lemh"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
<clip pos="1" side="tl" part="lemq"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The 'clip' entries are grabbing the 'pos=1' pattern match (here there is only one, anyway) from the translated side, then stripping the various 'parts' before reassembling them in order, ensuring the lemma tail/queue is at the end.<br />
<br />
'lemh' and 'lemq' are (unusual) predefined part-definitions, which refer to the head e.g. 'roll' and tail e.g. 'over ...' of the lemma. 'a_verb' and 'temps' are the usual def-attrs for capturing a verb and tags, and the end-form of a verb such as 'imperative', 'past particle' etc. <br />
<br />
A verbose version of the above may be,<br />
<br />
<pre><br />
<section-def-cats><br />
<def-cat n="vblex"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
...<br />
<br />
</section-def-cats><br />
<br />
<section-def-attrs><br />
<def-attr n="a_lexical_verb"><br />
<attr-item tags="vblex"><br />
</def-attr><br />
<br />
<def-attr n="a_verb_form"><br />
<attr-item tags="inf"><br />
<attr-item tags="imp"><br />
<attr-item tags="pp"><br />
<attr-item tags="pprs"><br />
<attr-item tags="ger"><br />
<attr-item tags="subs"><br />
<attr-item tags="pres"><br />
<attr-item tags="past"><br />
</def-attr><br />
<br />
<def-attr n="a_persona"><br />
<attr-item tags="p1"><br />
<attr-item tags="p2"><br />
<attr-item tags="mp3"><br />
</def-attr><br />
<br />
<def-attr n="a_gender"><br />
<attr-item tags="f"><br />
<attr-item tags="m"><br />
<attr-item tags="mf"><br />
</def-attr><br />
<br />
<def-attr n="a_number"><br />
<attr-item tags="sg"><br />
<attr-item tags="pl"><br />
<attr-item tags="sp"><br />
<attr-item tags="ND"><br />
</def-attr><br />
...<br />
<br />
</section-def-attrs><br />
<br />
<br />
<pre><br />
<rule comment="VBLEX"><br />
<pattern><br />
<pattern-item n="vblex"/><br />
</pattern><br />
<action><br />
<out><br />
<lu> <br />
<clip pos="1" side="tl" part="lemh"/><br />
<clip pos="1" side="tl" part="a_lexical_verb"/><br />
<clip pos="1" side="tl" part="a_verb_form"/><br />
<clip pos="1" side="tl" part="a_persona"/><br />
<clip pos="1" side="tl" part="a_gender"/><br />
<clip pos="1" side="tl" part="a_number"/><br />
<clip pos="1" side="tl" part="lemq"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
<br />
But your language dictionary/translation may have different needs. You may not be handling multiword verbs. You may need other categories of tagging.<br />
<br />
=== Simple usage of &lt;j/&gt; ===<br />
The documentation gives the following example from monodix:<br />
<br />
<pre><br />
<e lm="del" r="LR"> <br />
<p> <br />
<l>del</l> <br />
<r>de<s n="pr"/><j/>el<s n="det"/><s n="def"/><s n="m"/><s n="sg"/></r> <br />
</p> <br />
</e> <br />
</pre><br />
<br />
(This is marked r="LR" and so will only be used in analysis.) When "del" is read, the output from the analyser is <br />
<br />
^del/de<pr>+el<det><def><m><sg>$<br />
<br />
This is passed as-is through the tagger, but [[apertium-pretransfer]] turns it into <br />
<br />
^de<pr>$ ^el<det><def><m><sg>$^<br />
<br />
before bidix lookup.<br />
<br />
(This also happens with compounds.)<br />
<br />
<br />
<br />
== Can I translate simple entries to and from multiword entries? ==<br />
From the [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf Manual],<br />
<br />
<blockquote><br />
It can be the case that a lemma is a multiword of this kind in one language and a single word in the other language. In that case, in the bilingual dictionary, the multiword will contain the <g> element and the single word will not.<br />
</blockquote><br />
<br />
Yes you can. The bidex only cares about lemmas.<br />
<br />
<br />
<br />
==The complicated cases==<br />
<br />
Its possible to have pretty complex multiword combinations.<br />
<br />
<pre><br />
<e lm="zračna luka"><br />
<i>zračn</i><br />
<par n="zračn/a__adj"/><br />
<p><br />
<l><b/>luk</l><br />
<r><g><b/>luk</g></r><br />
</p><br />
<par n="stolic/a__n"/><br />
</e><br />
</pre><br />
<br />
<pre><br />
$ echo "zračna luka" | lt-proc sh-mk.automorf.bin <br />
^zračna luka/zračna<adj><f><sg><nom># luka<n><f><gen><pl>/zračna<adj><f><sg><nom># luka<n><f><nom><sg>$<br />
<br />
$ echo "zračna luka" | lt-proc sh-mk.automorf.bin | apertium-tagger -g sh-mk.prob <br />
^zračna<adj><f><sg><nom># luka<n><f><gen><pl>$<br />
<br />
$ echo "zračna luka" | lt-proc sh-mk.automorf.bin | apertium-tagger -g sh-mk.prob | apertium-pretransfer<br />
^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$<br />
</pre><br />
<br />
;Need to consider:<br />
<br />
* Analysis<br />
* Transfer (e.g. in the bidix)<br />
* Generation<br />
* Head initial, and head final multiwords (e.g. adj+noun and phrasal verbs)<br />
<br />
;Problems:<br />
<br />
* How to resolve <code><nowiki>^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$</nowiki></code> in the bidix?<br />
<br />
;Solutions:<br />
<br />
* Have two paradigms for each adjective, one with tags, one without. (bad) <br />
::This would leave us with: ^zračna luka<n><f><gen><pl>$ (basically an orthographic paradigm).<br />
* Have more than one entry per multi-word &mdash; this is done in <code>apertium-es-ca</code>, see "dirección general", "direcciones generales". (bad)<br />
* Have a parameterised paradigm, that when called one way outputs a paradigm with symbols, and another way outputs a paradigm without symbols.<br />
::This would only be one way, the problem would come when we try and generate. How do we get the adjective to agree with the noun?<br />
<br />
===The Spanish hack===<br />
<br />
This is how it is taken care of in the current <code>apertium-es-ca</code> pair, which is tenable just about for Spanish, but for Slavic languages no chance.<br />
<br />
<pre><br />
<e lm="dirección general"><br />
<p><br />
<l>dirección<b/>general</l><br />
<r>dirección<b/>general<s n="n"/><s n="f"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e lm="dirección general"><br />
<p><br />
<l>direcciones<b/>generales</l><br />
<r>dirección<b/>general<s n="n"/><s n="f"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pre><br />
<br />
=== The Polish hack ===<br />
<br />
The Polish analyser uses [[Metadix]] to solve the multiword problem, though this is less than desirable:<br />
<br />
<pre><br />
<pardef n="kamie/ń [nazębn]y__n"><br />
<e><br />
<p><br />
<l>ń<b/></l><br />
<r>ń<b/></r><br />
</p><br />
<i><prm/></i><br />
<p><br />
<l>y</l><br />
<r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>nia<b/></l><br />
<r>ń<b/></r><br />
</p><br />
<i><prm/></i><br />
<p><br />
<l>ego</l><br />
<r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r><br />
</p><br />
</e><br />
[etc.]<br />
</pardef><br />
</pre><br />
<br />
with the following entries:<br />
<br />
<pre><br />
<e lm="kamień nazębny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="nazębn"/></e><br />
<e lm="kamień szlachetny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="szlachetn"/></e><br />
</pre><br />
<br />
=== The Nynorsk hack ===<br />
<br />
(See [https://sourceforge.net/mailarchive/message.php?msg_name=e94dc08d0910291130p14f4cbc0l87e15d138840b074%40mail.gmail.com this mailing list discussion for alternative versions.])<br />
<br />
'''What we want:'''<br />
<br />
anbefale<vblex> => rå til<br />
anbefale<vblex> ikke<adv> => rå ikkje til<br />
publisere<vblex> => gje ut<br />
publisere<vblex> helst<adv> daglig<adv> => gje helst dagleg ut<br />
<br />
ie. we want a simple Bokmål verb translated into a particle verb, and any following string of adverbs should be placed between the (inflected) verb and the (uninflected/invariant) particle.<br />
<br />
'''The hack:'''<br />
<br />
For generation we don't actually need the multiwords in monodix (but it doesn't hurt). We have the regular multiword entry in bidix:<br />
<pre><br />
<e> <p><l>rå<g><b/>til</g></l><r>anbefale</r></p><par n="vblex"/></e><br />
</pre><br />
<br />
and the transfer rule that matches "vblex adv" writes<br />
<br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lemh"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
<b pos="1"/><br />
<lu><clip pos="2" side="tl" part="whole"/></lu><br />
<b/><br />
<lu><clip pos="1" side="tl" part="lemq"/></lu><br />
</out><br />
<br />
So now transfer will give us the following result:<br />
<pre><br />
echo ^anbefale<vblex><pret>$ ^ikke<adv>$ | apertium-transfer apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin nb-nn.autobil.bin<br />
^rå<vblex><pret>$ ^ikkje<adv>$ ^# til$<br />
</pre><br />
Thus we have three "lemma" which need dictionary entries in generation, the first to ("rå" and "ikkje") are in there already as regular simple entries, the last one is "# til", which we add in this manner:<br />
<br />
<pre><br />
<e lm="# til" r="RL"><p><l>til</l><r># til</r></p></e><br />
</pre><br />
Ugly, but it works. And since there are not very many such particles, the Nynorsk monodix doesn't need ''that'' many ugly entries.<br />
<br />
<br />
Of course, the Nynorsk monodix could also have "regular" entries for multiwords with inner inflection for catching "rå til" when there are no adverbs between the two, but we won't be able to ''analyse'' "rå ikkje/helst/dagleg til" with the above method.<br />
<br />
==See also==<br />
<br />
* [[Separable verbs]] <br />
** [[Yiddish morphology#Verbs]]<br />
* [[Módulo_de_procesamiento_de_expresiones_separables]]<br />
<br />
[[Category:Multiwords]]<br />
[[Category:Documentation in English]]</div>Thatprogrammerhttps://wiki.apertium.org/w/index.php?title=User:Thatprogrammer&diff=64859User:Thatprogrammer2017-12-08T02:17:23Z<p>Thatprogrammer: Created page with "Student in Google Codein 2017-2018 IRC Nickname: EthanYang"</p>
<hr />
<div>Student in Google Codein 2017-2018<br />
IRC Nickname: EthanYang</div>Thatprogrammer