Difference between revisions of "Multiwords"

Revision as of 07:38, 11 June 2010

Overview

lttoolbox currently has three mechanisms for creating multiwords, of varying complexity:

simply inserts a blank; use it if you want a word that has a space in it, but only inflection at the end

entry: <e><i>record<b/>player</i><par n="house__n"/></e>

analysis: ^record player/record player<n><sg>$

analysis: ^record players/record player<n><pl>$

<g/> is used (in combination with ) when you have inflection in the middle of the word, and an invariant part at the end

entry: <e><i>coffee</i><par n="house__n"/><p><l><b/>with<b/>milk</l><r><g><b/>with<b/>milk</g></r></p></e>

analysis: ^coffee with milk/coffee<n><sg># with milk$

analysis: ^coffees with milk/coffee<n><pl># with milk$

<j/> is used when you want the analysis to be split into two lexical units
entry: <e><p><l>la<b/>mayoría<b/>de</l><r>la<b/>mayoría<s n="prn"/><s n="tn"/><s n="mf"/><s n="sp"/><j/>de<s n="pr"/></r></p></e>
analysis: ^la mayoría de/la mayoría<prn><tn><mf><sp>+de<pr>$

Simple usage

There is an example from English to Esperanto.
In en.dix is

<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
<e lm="become acquainted">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted</l>  
    <r><g><b/>acquainted</g></r>    
  </p>
</e>
<e lm="become acquainted with">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted<b/>with</l>
    <r><g><b/>acquainted<b/>with</g></r>
  </p>
</e>

So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <b/> is a space (blank) and that the fixed words are inside <g> </g>.
In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are

<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<s n="vblex"/></l><r>become<g><b/>acquainted</g><s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<g><b/>kun</g><s n="vblex"/></l><r>become<g><b/>acquainted<b/>with</g><s n="vblex"/></r></p></e>

And the eo monodix

<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi kun">
  <i>konatiĝ</i>
  <par n="verb__vblex"/>
  <p>
    <l><b/>kun</l>
    <r><g><b/>kun</g></r>
  </p>
</e>

Note how the English fixed words <g><b/>acquaintedwith</g> become <g><b/>kun</g>
Also note that you need at least one verbal transfer rule to ensure that the invariant part ("lemq") is put after the morphological tags (a_verb, temps):

    <rule comment="VBLEX">
      <pattern>
	<pattern-item n="vblex"/>
      </pattern>
      <action>
        <out>
          <lu>	    
            <clip pos="1" side="tl" part="lemh"/>
            <clip pos="1" side="tl" part="a_verb"/>
            <clip pos="1" side="tl" part="temps"/>
            <clip pos="1" side="tl" part="lemq"/>
          </lu>
        </out>
      </action>
    </rule>

The complicated cases

Its possible to have pretty complex multiword combinations.

    <e lm="zračna luka">
      <i>zračn</i>
      <par n="zračn/a__adj"/>
      <p>
        <l><b/>luk</l>
        <r><g><b/>luk</g></r>
      </p>
      <par n="stolic/a__n"/>
    </e>

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin 
^zračna luka/zračna<adj><f><sg><nom># luka<n><f><gen><pl>/zračna<adj><f><sg><nom># luka<n><f><nom><sg>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob 
^zračna<adj><f><sg><nom># luka<n><f><gen><pl>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob  | apertium-pretransfer
^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$

Need to consider

Analysis
Transfer (e.g. in the bidix)
Generation
Head initial, and head final multiwords (e.g. adj+noun and phrasal verbs)

Problems

How to resolve ^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$ in the bidix?

Solutions

Have two paradigms for each adjective, one with tags, one without. (bad)This would leave us with: ^zračna luka<n><f><gen><pl>$ (basically an orthographic paradigm).

Have more than one entry per multi-word — this is done in apertium-es-ca, see "dirección general", "direcciones generales". (bad)
Have a parameterised paradigm, that when called one way outputs a paradigm with symbols, and another way outputs a paradigm without symbols.

This would only be one way, the problem would come when we try and generate. How do we get the adjective to agree with the noun?

The Spanish hack

This is how it is taken care of in the current apertium-es-ca pair, which is tenable just about for Spanish, but for Slavic languages no chance.

    <e lm="dirección general">
      <p>
        <l>dirección<b/>general</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="sg"/></r>
      </p>
    </e>
    <e lm="dirección general">
      <p>
        <l>direcciones<b/>generales</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="pl"/></r>
      </p>
    </e>

The Polish hack

The Polish analyser uses Metadix to solve the multiword problem, though this is less than desirable:

<pardef n="kamie/ń [nazębn]y__n">
  <e>
    <p>
      <l>ń<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>y</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r>
    </p>
  </e>
  <e>
    <p>
      <l>nia<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>ego</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r>
    </p>
  </e>
  [etc.]
</pardef>

with the following entries:

<e lm="kamień nazębny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="nazębn"/></e>
<e lm="kamień szlachetny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="szlachetn"/></e>

The Nynorsk hack

(See this mailing list discussion for alternative versions.)
What we want:

anbefale<vblex> => rå til
anbefale<vblex> ikke<adv> => rå ikkje til
publisere<vblex> => gje ut
publisere<vblex> helst<adv> daglig<adv> => gje helst dagleg ut

ie. we want a simple Bokmål verb translated into a particle verb, and any following string of adverbs should be placed between the (inflected) verb and the (uninflected/invariant) particle.
The hack:
For generation we don't actually need the multiwords in monodix (but it doesn't hurt). We have the regular multiword entry in bidix:

 <e>       <p><l>rå<g><b/>til</g></l><r>anbefale</r></p><par n="vblex"/></e>

and the transfer rule that matches "vblex adv" writes

      <out>
        <lu>
          <clip pos="1" side="tl" part="lemh"/>
          <clip pos="1" side="tl" part="a_verb"/>
          <clip pos="1" side="tl" part="temps"/>
        </lu>
        
        <lu><clip pos="2" side="tl" part="whole"/></lu>
        
        <lu><clip pos="1" side="tl" part="lemq"/></lu>
      </out>

So now transfer will give us the following result:

 echo ^anbefale<vblex><pret>$ ^ikke<adv>$ | apertium-transfer apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin nb-nn.autobil.bin
 ^rå<vblex><pret>$ ^ikkje<adv>$ ^# til$

Thus we have three "lemma" which need dictionary entries in generation, the first to ("rå" and "ikkje") are in there already as regular simple entries, the last one is "# til", which we add in this manner:

   <e lm="# til" r="RL"><p><l>til</l><r># til</r></p></e>

Ugly, but it works. And since there are not very many such particles, the Nynorsk monodix doesn't need that many ugly entries.


Of course, the Nynorsk monodix could also have "regular" entries for multiwords with inner inflection for catching "rå til" when there are no adverbs between the two, but we won't be able to analyse "rå ikkje/helst/dagleg til" with the above method.

See also

Separable verbs
Módulo_de_procesamiento_de_expresiones_separables

@@ Line 1: / Line 1: @@
 {{TOCD}}
+==Overview==
 [[lttoolbox]] currently has three mechanisms for creating multiwords, of varying complexity:
 # <code><b/></code> simply inserts a blank; use it if you want a word that has a space in it, but only inflection at the end
@@ Line 11: / Line 12: @@
 #* <pre>analysis: ^coffees with milk/coffee<n><pl># with milk$</pre>
 # <code><j/> is used when you want the analysis to be split into two lexical units
+#* <pre>entry: <e><p><l>la<b/>mayoría<b/>de</l><r>la<b/>mayoría<s n="prn"/><s n="tn"/><s n="mf"/><s n="sp"/><j/>de<s n="pr"/></r></p></e></pre>
+#* <pre>analysis: ^la mayoría de/la mayoría<prn><tn><mf><sp>+de<pr>$</pre>
 ==Simple usage==
 There is an example from English to Esperanto.

Difference between revisions of "Multiwords"

Revision as of 07:38, 11 June 2010

Contents

Overview

Simple usage

The complicated cases

The Spanish hack

The Polish hack

The Nynorsk hack

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools