Multi-mots

From Apertium
Revision as of 09:07, 4 June 2012 by Bech (talk | contribs) (Début de traduction)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Apperçu

lttoolbox (français) présente actuellement trois mécanismes pour créer des multi-mots, de complexité variable :

  1. insertion simple de blancs ; utilisez-le si vous voulez un mot avec un espace interne, mais seulement l'inflection à la fin
    • entrée : <e><i>record<b/>player</i><par n="house__n"/></e>
    • analyse : ^record player/record player<n><sg>$
    • analyse : ^record players/record player<n><pl>$
  2. <g/> est utilisé (en combinaison avec ) quand vous avez une inflection au milieu du mot, et une partie invariante à la fin
    • entrée : <e><i>café</i><par n="house__n"/><p><l><b/>au<b/>lait</l><r><g><b/>with<b/>milk</g></r></p></e>
    • analyse : ^café au lait/café<n><sg># au lait$
    • analyse : ^cafés au lait/café<n><pl># au lait$
  3. <j/> est utilisé quand vous voulez que le multi-mot soit séparé en deux unités lexicales, chacune avec sa propre analyse (ensemble de tags), où chaque partie peut varier indépendemment
    • entrée : <e>wr</i><par n="wr/ite__vblex"/><p><l><b/>about</l><r><j/>about<s n="pr"/></r></p></e>
    • analyse : ^write about/write<vblex><inf>+about<pr>/write<vblex><pres>+about<pr>$
    • analyse : ^writes about/write<vblex><pri><p3><sg>+about<pr>$

Plus d'information là dessus dans Utilisation simple et documentation (esp. sec.3.1.2.6).


Les multi-mots suivants ne sont pas très bien supportés pour l'instant:

  • Multi-mots d'accord : multi-mots complexes où deux parties (ou plus) montrent une sorte d'accord / dépendance des balises (ou, où certaines combinaisons de balises sont illégales)
    • lt-mwpp takes a file which specifies which lemma combinaisons are multi-mots, and what tags need to agree, and generates all the legal combinaisons in the lttoolbox dix format
  • Multi-mots discontinus : multi-mots with an arbitrary number of unrelated words in between, eg. the separable verbs in Germanic languages

(but see hacks below)

Utilisation simple

Utilisation simple de <g/> et <b/>

There is an example from English to Esperanto with inner inflection followed by an invariant part with spaces.

In en.dix is

<e lm="become"><i>bec</i><par n="bec/ome__vblex"/></e>
<e lm="become acquainted">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted</l>  
    <r><g><b/>acquainted</g></r>    
  </p>
</e>
<e lm="become acquainted with">
  <i>bec</i>
  <par n="bec/ome__vblex"/>
  <p>
    <l><b/>acquainted<b/>with</l>
    <r><g><b/>acquainted<b/>with</g></r>
  </p>
</e>

So become is conjugated as a normal verb and the rest is fixed (invariant). Note that <b/> is a space (blank) and that the fixed words are inside <g> </g>.

In Esperanto, "become" is "iĝi" (or "fariĝi"), "become acquainted" is "konatiĝi" and "become acquainted with" is "konatiĝi kun". The iĝi/konatiĝi should be conjugated according to become. Thus the bidix entries are

<e><p><l>iĝi<s n="vblex"/></l><r>become<s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<s n="vblex"/></l><r>become<g><b/>acquainted</g><s n="vblex"/></r></p></e>
<e><p><l>konatiĝi<g><b/>kun</g><s n="vblex"/></l><r>become<g><b/>acquainted<b/>with</g><s n="vblex"/></r></p></e>

And the eo monodix

<e lm="iĝi"><i>iĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi"><i>konatiĝ</i><par n="verb__vblex"/></e>
<e lm="konatiĝi kun">
  <i>konatiĝ</i>
  <par n="verb__vblex"/>
  <p>
    <l><b/>kun</l>
    <r><g><b/>kun</g></r>
  </p>
</e>	

Note how the English fixed words <g><b/>acquaintedwith</g> become <g><b/>kun</g>

Also note that you need at least one verbal transfer rule to ensure that the invariant part ("lemq") is put after the morphological tags (a_verb, temps):

    <rule comment="VBLEX">
      <pattern>
	<pattern-item n="vblex"/>
      </pattern>
      <action>
        <out>
          <lu>	    
            <clip pos="1" side="tl" part="lemh"/>
            <clip pos="1" side="tl" part="a_verb"/>
            <clip pos="1" side="tl" part="temps"/>
            <clip pos="1" side="tl" part="lemq"/>
          </lu>
        </out>
      </action>
    </rule>

Utilisation simple de <j/>

The documentation gives the following example from monodix:

<e lm="del" r="LR"> 
  <p> 
    <l>del</l> 
    <r>de<s n="pr"/><j/>el<s n="det"/><s n="def"/><s n="m"/><s n="sg"/></r> 
  </p> 
</e> 

(This is marked r="LR" and so will only be used in analysis.) When "del" is read, the output from the analyser is

^del/de<pr>+el<det><def><m><sg>$

This is passed as-is through the tagger, but apertium-pretransfer turns it into

^de<pr>$ ^el<det><def><m><sg>$^

before bidix lookup.

Les cas compliqués

Its possible to have pretty complex multi-mot combinaisons.

    <e lm="zračna luka">
      <i>zračn</i>
      <par n="zračn/a__adj"/>
      <p>
        <l><b/>luk</l>
        <r><g><b/>luk</g></r>
      </p>
      <par n="stolic/a__n"/>
    </e>
$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin 
^zračna luka/zračna<adj><f><sg><nom># luka<n><f><gen><pl>/zračna<adj><f><sg><nom># luka<n><f><nom><sg>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob 
^zračna<adj><f><sg><nom># luka<n><f><gen><pl>$

$ echo "zračna luka" |  lt-proc sh-mk.automorf.bin  | apertium-tagger -g sh-mk.prob  | apertium-pretransfer
^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$
Need to consider
  • Analysis
  • Transfer (e.g. in the bidix)
  • Generation
  • Head initial, and head final multi-mots (e.g. adj+noun and phrasal verbs)
Problems
  • How to resolve ^zračna# luka<adj><f><sg><nom><n><f><gen><pl>$ in the bidix?
Solutions
  • Have two paradigms for each adjective, one with tags, one without. (bad)
This would leave us with: ^zračna luka<n><f><gen><pl>$ (basically an orthographic paradigm).
  • Have more than one entry per multi-word — this is done in apertium-es-ca, see "dirección general", "direcciones generales". (bad)
  • Have a parameterised paradigm, that when called one way outputs a paradigm with symbols, and another way outputs a paradigm without symbols.
This would only be one way, the problem would come when we try and generate. How do we get the adjective to agree with the noun?

The Spanish hack

This is how it is taken care of in the current apertium-es-ca pair, which is tenable just about for Spanish, but for Slavic languages no chance.

    <e lm="dirección general">
      <p>
        <l>dirección<b/>general</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="sg"/></r>
      </p>
    </e>
    <e lm="dirección general">
      <p>
        <l>direcciones<b/>generales</l>
        <r>dirección<b/>general<s n="n"/><s n="f"/><s n="pl"/></r>
      </p>
    </e>

The Polish hack

The Polish analyser uses Metadix to solve the multi-mot problem, though this is less than desirable:

<pardef n="kamie/ń [nazębn]y__n">
  <e>
    <p>
      <l>ń<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>y</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r>
    </p>
  </e>
  <e>
    <p>
      <l>nia<b/></l>
      <r>ń<b/></r>
    </p>
    <i><prm/></i>
    <p>
      <l>ego</l>
      <r>y<s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r>
    </p>
  </e>
  [etc.]
</pardef>

with the following entries:

<e lm="kamień nazębny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="nazębn"/></e>
<e lm="kamień szlachetny"><i>kamie</i><par n="kamie/ń [nazębn]y__n" prm="szlachetn"/></e>

The Nynorsk hack

(See this mailing list discussion for alternative versions.)

What we want:

anbefale<vblex> => rå til
anbefale<vblex> ikke<adv> => rå ikkje til
publisere<vblex> => gje ut
publisere<vblex> helst<adv> daglig<adv> => gje helst dagleg ut

ie. we want a simple Bokmål verb translated into a particle verb, and any following string of adverbs should be placed between the (inflected) verb and the (uninflected/invariant) particle.

The hack:

For generation we don't actually need the multi-mots in monodix (but it doesn't hurt). We have the regular multi-mot entry in bidix:

 <e>       <p><l>rå<g><b/>til</g></l><r>anbefale</r></p><par n="vblex"/></e>

and the transfer rule that matches "vblex adv" writes

      <out>
        <lu>
          <clip pos="1" side="tl" part="lemh"/>
          <clip pos="1" side="tl" part="a_verb"/>
          <clip pos="1" side="tl" part="temps"/>
        </lu>
        
        <lu><clip pos="2" side="tl" part="whole"/></lu>
        
        <lu><clip pos="1" side="tl" part="lemq"/></lu>
      </out>

So now transfer will give us the following result:

 echo ^anbefale<vblex><pret>$ ^ikke<adv>$ | apertium-transfer apertium-nn-nb.nb-nn.t1x nb-nn.t1x.bin nb-nn.autobil.bin
 ^rå<vblex><pret>$ ^ikkje<adv>$ ^# til$

Thus we have three "lemma" which need dictionary entries in generation, the first to ("rå" and "ikkje") are in there already as regular simple entries, the last one is "# til", which we add in this manner:

   <e lm="# til" r="RL"><p><l>til</l><r># til</r></p></e>

Ugly, but it works. And since there are not very many such particles, the Nynorsk monodix doesn't need that many ugly entries.


Of course, the Nynorsk monodix could also have "regular" entries for multi-mots with inner inflection for catching "rå til" when there are no adverbs between the two, but we won't be able to analyse "rå ikkje/helst/dagleg til" with the above method.

Voir aussi