Difference between revisions of "Multiwords"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
  +
{{TOCD}}
  +
 
==Simple usage==
 
==Simple usage==
 
There is an example from English to Esperanto.
 
There is an example from English to Esperanto.
Line 5: Line 7:
 
<pre>
 
<pre>
 
<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
 
<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
<e lm="become acquainted"><i>bec</i><par n="bec/ome__vblex"/><p> <l><b/>acquainted</l> <r><g><b/>acquainted</g></r> </p></e>
+
<e lm="become acquainted">
  +
<i>bec</i>
<e lm="become acquainted with"><i>bec</i><par n="bec/ome__vblex"/><p><l><b/>acquainted<b/>with</l><r><g><b/>acquainted<b/>with</g></r></p></e>
 
  +
<par n="bec/ome__vblex"/>
  +
<p>
  +
<l><b/>acquainted</l>
  +
<r><g><b/>acquainted</g></r>
  +
</p>
  +
</e>
  +
<e lm="become acquainted with">
  +
<i>bec</i>
  +
<par n="bec/ome__vblex"/>
  +
<p>
  +
<l><b/>acquainted<b/>with</l>
  +
<r><g><b/>acquainted<b/>with</g></r>
  +
</p>
  +
</e>
 
</pre>
 
</pre>
So become is flexed as a normal verb and the rest is fixed. Note that <code><b/></code> is a space and that the fixed words are inside <code><g> </g></code>.
+
So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <code>&lt;b/&gt;</code> is a space (blank) and that the fixed words are inside <code><g> </g></code>.
   
In Esperanto "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be flexed according to become. The the bidix entries become
+
In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are
 
<pre>
 
<pre>
 
<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
 
<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
Line 21: Line 37:
 
<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
 
<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
 
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
 
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi kun"><i>konatiĝ</i><par n="verb__vblex"/><p><l><b/>kun</l><r><g><b/>kun</g></r></p></e>
+
<e lm="konatiĝi kun">
  +
<i>konatiĝ</i>
  +
<par n="verb__vblex"/>
  +
<p>
  +
<l><b/>kun</l>
  +
<r><g><b/>kun</g></r>
  +
</p>
  +
</e>
 
</pre>
 
</pre>
   
Note how the English fixed words <g><b/>acquainted<b/>with</g> become <g><b/>kun</g>
+
Note how the English fixed words <code><g>&lt;b/&gt;acquainted<b/>with</g></code> become <code><g>&lt;b/&gt;kun</g></code>
 
   
 
==The complicated cases==
 
==The complicated cases==

Revision as of 10:07, 10 September 2008

Simple usage

There is an example from English to Esperanto.

In en.dix is

<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
<e lm="become acquainted">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted</l>  
    <r><g><b/>acquainted</g></r>    
  </p>
</e>
<e lm="become acquainted with">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted<b/>with</l>
    <r><g><b/>acquainted<b/>with</g></r>
  </p>
</e>

So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <b/> is a space (blank) and that the fixed words are inside <g> </g>.

In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are

<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<s n="vblex"/></l><r>become<g><b/>acquainted</g><s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<g><b/>kun</g><s n="vblex"/></l><r>become<g><b/>acquainted<b/>with</g><s n="vblex"/></r></p></e>

And the eo modix

<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi kun">
  <i>konatiĝ</i>
  <par n="verb__vblex"/>
  <p>
    <l><b/>kun</l>
    <r><g><b/>kun</g></r>
  </p>
</e>	

Note how the English fixed words <g><b/>acquaintedwith</g> become <g><b/>kun</g>

The complicated cases

Its possible to have pretty complex multiword combinations.

    <e lm="zračna luka">
      <i>zračn</i>
      <par n="zračn/a__adj"/>
      <p>
        <l><b/>luk</l>
        <r><g><b/>luk</g></r>
      </p>
      <par n="stolic/a__n"/>
    </e>
$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin 
^zračna luka/zračna<adj><f><sg><nom># luka<n><f><gen><pl>/zračna<adj><f><sg><nom># luka<n><f><nom><sg>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob 
^zračna<adj><f><sg><nom># luka<n><f><gen><pl>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob  | apertium-pretransfer
^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$
Need to consider
  • Analysis
  • Transfer (e.g. in the bidix)
  • Generation
  • Head initial, and head final multiwords (e.g. adj+noun and phrasal verbs)
Problems
  • How to resolve ^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$ in the bidix?
Solutions
  • Have two paradigms for each adjective, one with tags, one without. (bad)
This would leave us with: ^zračna luka<n><f><gen><pl>$ (basically an orthographic paradigm).
  • Have more than one entry per multi-word — this is done in apertium-es-ca, see "dirección general", "direcciones generales". (bad)
  • Have a parameterised paradigm, that when called one way outputs a paradigm with symbols, and another way outputs a paradigm without symbols.
This would only be one way, the problem would come when we try and generate. How do we get the adjective to agree with the noun?

The Spanish hack

This is how it is taken care of in the current apertium-es-ca pair, which is tenable just about for Spanish, but for Slavic languages no chance.

    <e lm="dirección general">
      <p>
        <l>dirección<b/>general</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="sg"/></r>
      </p>
    </e>
    <e lm="dirección general">
      <p>
        <l>direcciones<b/>generales</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="pl"/></r>
      </p>
    </e>

The Polish hack

The Polish analyser uses Metadix to solve the multiword problem, though this is less than desirable:

<pardef n="kamie/ń [nazębn]y__n">
  <e>
    <p>
      <l>ń<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>y</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r>
    </p>
  </e>
  <e>
    <p>
      <l>nia<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>ego</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r>
    </p>
  </e>
  [etc.]
</pardef>

with the following entries:

<e lm="kamień nazębny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="nazębn"/></e>
<e lm="kamień szlachetny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="szlachetn"/></e>