Multiwords

From Apertium
Revision as of 13:09, 25 November 2016 by Rcrowther (talk | contribs)
Jump to navigation Jump to search

The term multiword includes simple words that have spaces in them, words with separable parts, contractions and compounds of several lemmas. Apertium supports these to varying degrees.

En français

Overview

lttoolbox currently has four mechanisms for creating multiwords, of varying complexity:

  1. simply inserts a blank; use it if you want a word that has a space in it, but only inflection at the end
    • entry: <e><i>record<b/>player</i><par n="house__n"/></e>
    • analysis: ^record player/record player<n><sg>$
    • analysis: ^record players/record player<n><pl>$
  2. <g/> is used (in combination with ) when you have inflection in the middle of the word, and an invariant part at the end
    • entry: <e><i>coffee</i><par n="house__n"/><p><l><b/>with<b/>milk</l><r><g><b/>with<b/>milk</g></r></p></e>
    • analysis: ^coffee with milk/coffee<n><sg># with milk$
    • after disambiguation and pre-transfer: ^coffee# with milk<n><sg>$
    • analysis: ^coffees with milk/coffee<n><pl># with milk$
    • after disambiguation and pre-transfer: ^coffee# with milk<n><pl>$
  3. <j/> is used when you want the multiword to be split into two lexical units, each with its own analysis (set of tags), where both parts may vary independently
    • entry: <e>wr</i><par n="wr/ite__vblex"/><p><l><b/>about</l><r><j/>about<s n="pr"/></r></p></e>
    • analysis: ^write about/write<vblex><inf>+about<pr>/write<vblex><pres>+about<pr>$
    • after disambiguation and pre-transfer: ^write<vblex><inf>$ ^about<pr>$
    • analysis: ^writes about/write<vblex><pri><p3><sg>+about<pr>$
    • after disambiguation and pre-transfer: ^write<vblex><pri><p3><sg>$ ^about<pr>$
  4. <s n="compound-only-L"/> and <s n="compound-R"/> – an analysis with the compound-only-L tag in it can be the left part of a compound (many of these can chain), but can never stand alone as an analysis, while an analysis with the compound-R tag in it can be either a word on its own, or the final part of a compound.
    • entry: <e><p><l>kaffe</l><r>kaffe<s n="n"/><s n="m"/><s n="sg"/><s n="ind"/><s n="cmp"/><s n="compound-only-L"/></r></p>
    • entry: <e><p><l>bilet</l><r>bilete<s n="n"/><s n="nt"/><s n="sg"/><s n="ind"/><s n="cmp"/><s n="compound-only-L"/></r></p> 
    • entry: <e><p><l>kostnaden</l><r>kostnad<s n="n"/><s n="m"/><s n="sg"/><s n="def"/><s n="compound-R"/></r></p> 
    • analysis: ^kaffekostnaden/kaffe<n><m><sg><ind><cmp>+kostnad<n><m><sg><def>$
    • analysis: ^kaffebiletkostnaden/kaffe<n><m><sg><ind><cmp>+bilet<n><nt><sg><ind><cmp>+kostnad<n><m><sg><def>$
    • no analysis: ^bilet/*bilet$


More information on these below under Simple usage, Compounds and the documentation (esp. sec.3.1.2.6).


The following multiwords are not very well supported quite yet:

  • Agreement multiwords: complex multiwords where two (or more) parts show some sort of agreement/dependence of tags (or, where certain tag combinations are illegal)
    • lt-mwpp takes a file which specifies which lemma combinations are multiwords, and what tags need to agree, and generates all the legal combinations in the lttoolbox dix format
  • Discontiguous multiwords: multiwords with an arbitrary number of unrelated words in between, eg. the separable verbs in Germanic languages

(but see hacks below)

Why make a multiword entry?

The first thing to say about multiword translations is that they can sometimes be handled by knowing only the Monodix basics and Bilingual dictionary basics (and also the superblank '').

The English verb 'roll' is often supplemented with a word for orientation/direction e.g. 'to roll over' or 'to roll down'. Within the Apertium system, we would like to treat these as 'many (two) words that make a single verb'. You might find this many-word verb in a sentence like 'the car rolled down the hill'. Perhaps a linguist may wish to study word-constructions like this further, but recognising 'roll down' as one word is a good step forward in machine translation.

We can do this by creating an English monodix paradigm. Please note that his example is a little contrived, as the same effect can be made with no more than a section entry (however, if we show a paradigm the example will work even if the multiword verb was more difficult),

<pardef n="roll_down__vblex">
  <e>
    <p><l><b/>down</l><r><b/>down<s n="vblex"><s n="inf"></r></p>
  </e>
  <e>
    <p><l><b/>down</l><r><b/>down<s n="vblex"><s n="imp"></r></p>
  </e>
  <e>
    <p><l><b/>down</l><r><b/>down<s n="vblex"><s n="pp"></r></p>
  </e>
...
</pardef>

Note how the superblank '' is used to mark the limits of the words in the multiword verb.

We can use/trigger the multiword aradigm from an English monodix 'section' entry,

<e lm="roll down"><i>roll</i><par n="roll_down__vblex"></e>

Now, if another language needs to identify 'roll down' as a special verb, the above definition can be triggered from a bidex,

<e><p><l>???lemma from another language???</l><r>roll</b>down<s n="vblex"></r></p></e>

Note the use of the superblank '' again, this time to construct the lemma.

This is a surprisingly easy and clear way to construct multiword recognition. at the time of writing, you can find examples of this method in dictionaries on Apertium. This is possibly due to the ease and clarity, or because the dictionary entries are old.


So you may ask, 'why not let Apertium treat these two words as separate words?'. Apertium is a flexible system :) If you need to get the effect, various bidix/monodix entries, or a rule in the first stage of the chunker module will work.

But the stream you will generate will be something like (simplified, 'chunker' stage),

{roll<vblex><imp>}{down<at_pr>}

and does not reflect the connection of the two words. We would prefer a stream that looked like,

{roll down<vblex><imp>}

The lack of connection between the words may limit us later. We will not be able to identify the two words as one unit when translating back from English. We may be using the chunker for simple connection rules, which is not what the chunker is for and makes our translation pairs confusing to read. If we want to do further manipulations on the text stream, in either direction, tracing the effects will become harder and harder. If we have used the chunker we may find it difficult to use the chunker for other purposes.

By all means patch to make some progress, but this is not a good end solution.


Simple usage

Simple usage of <g/> and <b/>

There is an example from English to Esperanto with inner inflection followed by an invariant part with spaces.

In en.dix is

<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
<e lm="become acquainted">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted</l>  
    <r><g><b/>acquainted</g></r>    
  </p>
</e>
<e lm="become acquainted with">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted<b/>with</l>
    <r><g><b/>acquainted<b/>with</g></r>
  </p>
</e>

So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <b/> is a space (blank) and that the fixed words are inside <g> </g>.

When "become acquainted" is read from the analyser, the output is

^become acquainted/become<vblex><inf># acquainted$

Before lexical transfer, the "lemma queue" (# acquainted) is put onto the lemma:

^become# acquainted<vblex><inf>$

In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are

<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<s n="vblex"/></l><r>become<g><b/>acquainted</g><s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<g><b/>kun</g><s n="vblex"/></l><r>become<g><b/>acquainted<b/>with</g><s n="vblex"/></r></p></e>

And the eo monodix

<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi kun">
  <i>konatiĝ</i>
  <par n="verb__vblex"/>
  <p>
    <l><b/>kun</l>
    <r><g><b/>kun</g></r>
  </p>
</e>	

Note how the English fixed words <g><b/>acquaintedwith</g> become <g><b/>kun</g>

Also note that you need at least one verbal transfer rule to ensure that the invariant part, the lemq (this is the <g> in dix), is put after the morphological tags (a_verb, temps):

    <rule comment="VBLEX">
      <pattern>
	<pattern-item n="vblex"/>
      </pattern>
      <action>
        <out>
          <lu>	    
            <clip pos="1" side="tl" part="lemh"/>
            <clip pos="1" side="tl" part="a_verb"/>
            <clip pos="1" side="tl" part="temps"/>
            <clip pos="1" side="tl" part="lemq"/>
          </lu>
        </out>
      </action>
    </rule>

Simple usage of <j/>

The documentation gives the following example from monodix:

<e lm="del" r="LR"> 
  <p> 
    <l>del</l> 
    <r>de<s n="pr"/><j/>el<s n="det"/><s n="def"/><s n="m"/><s n="sg"/></r> 
  </p> 
</e> 

(This is marked r="LR" and so will only be used in analysis.) When "del" is read, the output from the analyser is

^del/de<pr>+el<det><def><m><sg>$

This is passed as-is through the tagger, but apertium-pretransfer turns it into

^de<pr>$ ^el<det><def><m><sg>$^

before bidix lookup.

(This also happens with compounds.)

The complicated cases

Its possible to have pretty complex multiword combinations.

    <e lm="zračna luka">
      <i>zračn</i>
      <par n="zračn/a__adj"/>
      <p>
        <l><b/>luk</l>
        <r><g><b/>luk</g></r>
      </p>
      <par n="stolic/a__n"/>
    </e>
$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin 
^zračna luka/zračna<adj><f><sg><nom># luka<n><f><gen><pl>/zračna<adj><f><sg><nom># luka<n><f><nom><sg>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob 
^zračna<adj><f><sg><nom># luka<n><f><gen><pl>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob  | apertium-pretransfer
^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$
Need to consider
  • Analysis
  • Transfer (e.g. in the bidix)
  • Generation
  • Head initial, and head final multiwords (e.g. adj+noun and phrasal verbs)
Problems
  • How to resolve ^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$ in the bidix?
Solutions
  • Have two paradigms for each adjective, one with tags, one without. (bad)
This would leave us with: ^zračna luka<n><f><gen><pl>$ (basically an orthographic paradigm).
  • Have more than one entry per multi-word — this is done in apertium-es-ca, see "dirección general", "direcciones generales". (bad)
  • Have a parameterised paradigm, that when called one way outputs a paradigm with symbols, and another way outputs a paradigm without symbols.
This would only be one way, the problem would come when we try and generate. How do we get the adjective to agree with the noun?

The Spanish hack

This is how it is taken care of in the current apertium-es-ca pair, which is tenable just about for Spanish, but for Slavic languages no chance.

    <e lm="dirección general">
      <p>
        <l>dirección<b/>general</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="sg"/></r>
      </p>
    </e>
    <e lm="dirección general">
      <p>
        <l>direcciones<b/>generales</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="pl"/></r>
      </p>
    </e>

The Polish hack

The Polish analyser uses Metadix to solve the multiword problem, though this is less than desirable:

<pardef n="kamie/ń [nazębn]y__n">
  <e>
    <p>
      <l>ń<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>y</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r>
    </p>
  </e>
  <e>
    <p>
      <l>nia<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>ego</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r>
    </p>
  </e>
  [etc.]
</pardef>

with the following entries:

<e lm="kamień nazębny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="nazębn"/></e>
<e lm="kamień szlachetny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="szlachetn"/></e>

The Nynorsk hack

(See this mailing list discussion for alternative versions.)

What we want:

anbefale<vblex> => rå til
anbefale<vblex> ikke<adv> => rå ikkje til
publisere<vblex> => gje ut
publisere<vblex> helst<adv> daglig<adv> => gje helst dagleg ut

ie. we want a simple Bokmål verb translated into a particle verb, and any following string of adverbs should be placed between the (inflected) verb and the (uninflected/invariant) particle.

The hack:

For generation we don't actually need the multiwords in monodix (but it doesn't hurt). We have the regular multiword entry in bidix:

 <e>       <p><l>rå<g><b/>til</g></l><r>anbefale</r></p><par n="vblex"/></e>

and the transfer rule that matches "vblex adv" writes

      <out>
        <lu>
          <clip pos="1" side="tl" part="lemh"/>
          <clip pos="1" side="tl" part="a_verb"/>
          <clip pos="1" side="tl" part="temps"/>
        </lu>
        
        <lu><clip pos="2" side="tl" part="whole"/></lu>
        
        <lu><clip pos="1" side="tl" part="lemq"/></lu>
      </out>

So now transfer will give us the following result:

 echo ^anbefale<vblex><pret>$ ^ikke<adv>$ | apertium-transfer apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin nb-nn.autobil.bin
 ^rå<vblex><pret>$ ^ikkje<adv>$ ^# til$

Thus we have three "lemma" which need dictionary entries in generation, the first to ("rå" and "ikkje") are in there already as regular simple entries, the last one is "# til", which we add in this manner:

   <e lm="# til" r="RL"><p><l>til</l><r># til</r></p></e>

Ugly, but it works. And since there are not very many such particles, the Nynorsk monodix doesn't need that many ugly entries.


Of course, the Nynorsk monodix could also have "regular" entries for multiwords with inner inflection for catching "rå til" when there are no adverbs between the two, but we won't be able to analyse "rå ikkje/helst/dagleg til" with the above method.

See also