Difference between revisions of "Talk:Multiwords"

From Apertium
Jump to navigation Jump to search
 
(4 intermediate revisions by 2 users not shown)
Line 152: Line 152:


The simplest solution I can think of is to just add a certain tag to these mwe's (in the mwe dictionary), the transfer rules can then insert a determiner if we have an adj, or the noun has that certain tag.
The simplest solution I can think of is to just add a certain tag to these mwe's (in the mwe dictionary), the transfer rules can then insert a determiner if we have an adj, or the noun has that certain tag.

: This is not a problem now that we have lt-proc -b

== Possible discontiguous + agreement mwe module ==

The module for processing discontiguous mwe's could also check for agreement before chunking these. In that case, the two types of mwe's could be merged into one module, which would let hfst users also merge agreement mwe's.

The module runs after pretransfer, so with

$ echo Hij lijkt sterk op mij |apertium -d . nl-de-pretransfer
^Prpers<prn><subj><p3><m><sg>$ ^lijken<vblex><pri><p2><sg>$ ^sterk<adv>$ ^op<pr>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$

we move the particle to get:

^Prpers<prn><subj><p3><m><sg>$ ^lijken# op<vblex><pri><p2><sg>$ ^sterk<adv>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$

The __dutch mwedef below could be used to check this. However, since we already check for the existence of certain tags on both parts, the module could be expanded to allow for transfer-type &lt;test&gt;'s, see the mwedef __agreement below.

<pre>
<mwdictionary>
<section-def-cats>
<def-cat n="det">
<cat-item tags="det.*"/>
<cat-item lemma="foo" tags="prn"/> <!-- treat foo<prn> as a determiner wrt. mwe's -->
</def-cat>
<def-cat n="adv">
<cat-item tags="adv"/>
</def-cat>
<zdef-cat n="noun">
<cat-item tags="n.*"/>
</def-cat>
<def-cat n="adj">
<cat-item tags="adj.*"/>
</def-cat>
<def-cat n="verb">
<cat-item tags="vblex.*"/>
</def-cat>
<def-cat n="prep">
<cat-item tags="pr"/>
</def-cat>
<def-cat n="anything">
<cat-item tags="*"/>
</def-cat>
</section-def-cats>
<section-def-attrs>
<def-attr n="nbr">
<attr-item tags="sg"/>
<attr-item tags="pl"/>
</def-attr>
<def-attr n="gen">
<attr-item tags="m"/>
<attr-item tags="f"/>
</def-attr>
<def-attr n="art">
<attr-item tags="ind"/>
<attr-item tags="def"/>
</def-attr>
</section-def-attrs>

<mwedefs>
<mwedef n="__dutch">
<pattern>
<pattern-item n="verb" />
<pattern-item n="prep" />
</pattern>
<allow>
<pattern-item n="adv"/> <!-- adv refers to the def-cat above -->
</allow>
<tags><clip pos="1" part="tags"/></tags>
</mwedef>

<mwedef n="__afrikaans">
<pattern>
<pattern-item n="verb" />
<pattern-item n="prep" />
</pattern>
<allow>
<pattern-item n="adj"/>
<pattern-item n="noun"/>
<pattern-item n="det"/>
<pattern-item n="adv"/>
</allow>
<tags><clip pos="1" part="tags"/></tags>
</mwedef>

<mwedef n="__norsk">
<pattern>
<pattern-item n="verb" />
<pattern-item n="prep" />
</pattern>
<allow>
<pattern-item n="anything"/>
</allow>
<tags><clip pos="1" part="tags"/></tags>
</mwedef>
<mwedef n="__agreement">
<pattern>
<pattern-item n="adj" />
<pattern-item n="noun" />
</pattern>
<allow>
<!-- allow nothing in between -->
</allow>
<test><and>
<equal><clip pos="1" part="nbr"/><clip pos="2" part="nbr"/></equal>
<equal><clip pos="1" part="gen"/><clip pos="2" part="gen"/></equal>
</and></test>
<tags>
<lit-tag n="n"/>
<clip pos="1" part="gen"/>
<clip pos="1" part="nbr"/>
<clip pos="1" part="art"/>
</tags>
</mwedef>
</mwedefs>

<mwes>
<e><p>
<l lemma="lijken" />
<l lemma="op" />
<r lemma="lijken# op" />
</p><par n="__dutch"/></e>

<e><p>
<l lemma="kondig" />
<l lemma="aan" />
<r lemma="aankondig" />
</p><par n="__afrikaans"/></e>

<e><p>
<l lemma="rå" />
<l lemma="til" />
<r lemma="rå# til" />
</p><par n="__norsk"/></e>

<e><p>
<l lemma="bærbar" />
<l lemma="datamaskin" />
<r lemma="bærbar datamaskin" />
</p><par n="__agreement"/></e>
</mwes>
</mwdictionary>
</pre>

==Examples==

* have a cold -> estar agripado

Latest revision as of 08:35, 21 March 2014

Yet another[edit]

<mw n="dirección general">
  <lu lemma="dirección" tags="n.*" head/>
  <lu lemma="general" tags="adj.mf.*"/>
</mw>

<mw n="zračna luka">
  <lu lemma="zračna" tags="adj.*"/>
  <lu lemma="luka" tags="n.*" head/>
</mw>

Tags from the lu marked "head" are preserved, where tags for others are removed. So the output would be:

^dirección<n><f><sg>$ ^general<adj><mf><sg>$ → ^dirección general<n><f><sg>$

While generation would look like:

^dirección general<n><f><pl>$ → ^dirección<n><f><pl>$ ^general<adj><mf><pl>$

Note how the tags marked in tags are preserved, where the rest are copied.

How could this be coded?

Another option[edit]

<spectie> jimregan, you might be able to just do it with a dictionary
<jimregan> I'm listening
<spectie> ok
<spectie> so imagine:
<jimregan> (err... well, reading :)
<spectie> ah no
<spectie> because you'd need to enumerate the tags
<spectie> although, that might not be so difficult if we have lt-expand
<spectie> ok
<spectie> here:
<spectie> <e>
<spectie>   <p>
<spectie>     <l>strajk<s n="n"/><s n="m"/><s n="sg"/><s n="nom"/><b/>włoski<s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></l>
<spectie>     <r>strajk<b/>włoski<s n="n"/><s n="m"/><s n="sg"/><s n="nom"/></r>
<spectie>   </p>
<spectie> </e>
<spectie>  
<spectie> then you just run it through the lt-proc again with a special mode set
<spectie> you'd run that before the transfer
<spectie> and it would work for both analysis and generation

And another[edit]


<jimregan> something like this
<jimregan> <multiword n="noun-adj_np.top_f">
<jimregan>  <replacements>
<jimregan>   <replace><l><s n="adj"/></l><r><s n="np"/><s n="top"/></r></replace>
<jimregan>  </replacements>
<jimregan>  <join>
<jimregan>   <i><s n="f"/></i>
<jimregan>   <i><s n="nom"/></i>
<jimregan>   <i><s n="gen"/></i>
<jimregan>   <i><s n="acc"/></i>
<jimregan>   <i><s n="dat"/></i>
<jimregan>   <i><s n="loc"/></i>
<jimregan>   <i><s n="ins"/></i>
<jimregan>   <i><s n="voc"/></i>
<jimregan>  </join>
<jimregan>  <restrict>
<jimregan>   <i><s n="f"/></i>
<jimregan>   <i><s n="sg"/></i>
<jimregan>  </restrict>
<jimregan> </multiword>
<jimregan> <multiword n="noun-adj_noun">
<jimregan>  <replacements>
<jimregan>   <replace><l><s n="adj"/></l><r><s n="n"/></r></replace>
<jimregan>   <replace><l><s n="m"/></l><r><s n="m3"/></r></replace>
<jimregan>  </replacements>
<jimregan>  <join>
<jimregan>   <i><s n="nom"/></i>
<jimregan>   <i><s n="gen"/></i>
<jimregan>   <i><s n="acc"/></i>
<jimregan>   <i><s n="dat"/></i>
<jimregan>   <i><s n="loc"/></i>
<jimregan>   <i><s n="ins"/></i>
<jimregan>   <i><s n="voc"/></i>
<jimregan>  </join>
<jimregan>  <restrict>
<jimregan>   <i><s n="sg"/></i>
<jimregan>  </restrict>
<jimregan> </multiword>
<jimregan> <mw lm="Wielka Brytania" type="noun-adj_np.top_f">
<jimregan>  <i>Wiel</i><par n="wiel/ki__adj"/>
<jimregan>  <i><b/></i>
<jimregan>  <i>Brytani</i><par n="Francj/a__np"/>
<jimregan> </mw>
<jimregan> <mw lm="strajk włoski" type="noun-adj_noun">
<jimregan>  <i>strajk</i><par n="maluch/__n"/>
<jimregan>  <i><b/></i>
<jimregan>  <i>włos</i><par n="pols/ki__adj"/>
<jimregan> </mw>
<spectie> hmm
<spectie> whats the "join" thing ?
<jimregan> oops. wasn't meant to have '<i><s n="f"/></i>' in the '<join>' of the first, just in <restrict>
<jimregan> where that tag exists in each parameter, use that as output
<spectie> where would this be called ?
<spectie> after analysis ?
<jimregan> possibly, but for the moment I'm thinking of adding it as a generated subsection of the analyser
<spectie> what do you reckon to my idea ?
<jimregan> each 'mw' would be expanded to an '<e>'
<jimregan> the problem is that I don't want to keep the adjective pardefs as simple as possible
<spectie> you don't ?
<jimregan> 'strajk wloski' would have to be 'm3', not 'm'
<spectie> aha
<jimregan> but in most cases it doesn't make sense to have the adjectives consider masculine gender subtypes separately
<spectie> ah ok
<spectie> i was thinking of putting in mine after tagging
<jimregan> so I want to have a stylesheet replace 'adj.m' with 'n.m3' in the strajk wloski case
<spectie> hmm
<spectie> it would work
<spectie> you could make the "<join>" thing a paradigm
<spectie> e.g. <pardef n="cases"><e><i><s n="nom"/></i></e> ... </pardef>     <join><par n="cases"/></join>
<jimregan> aha
<jimregan> yes

agreement multiwords (complex multiwords) in bidix/transfer[edit]

The assumption in lt-mwpp is that we can treat an adj+noun multiword as a single noun, similarly to creating a single entry for a compound noun. We can then have a bidix entry like

<e><p><l>mátkedihtor<s n="N"/></l><r>bærbar<b/>datamaskin<s n="n"/><s n="m"/></r></p><par n="__n"/></e>

even though "bærbar" is an adjective. However, there is a problem when generating Bokmål definite nouns from Northern Sámi. If they are preceded by an adjective, they need a determiner inserted, while bare definite nouns should not have a determiner inserted. If treated like a single noun, we get:

$ echo Dihtor lei doppe|apertium -d . sme-nob
Datamaskinen var der borte

$ echo Ođđa dihtor lei doppe|apertium -d . sme-nob
Den nye datamaskinen var der borte

$ echo Mátkedihtor lei doppe|apertium -d . sme-nob
Bærbare datamaskinen var der borte # should have "den" before it

The simplest solution I can think of is to just add a certain tag to these mwe's (in the mwe dictionary), the transfer rules can then insert a determiner if we have an adj, or the noun has that certain tag.

This is not a problem now that we have lt-proc -b

Possible discontiguous + agreement mwe module[edit]

The module for processing discontiguous mwe's could also check for agreement before chunking these. In that case, the two types of mwe's could be merged into one module, which would let hfst users also merge agreement mwe's.

The module runs after pretransfer, so with

$ echo Hij lijkt sterk op mij |apertium -d . nl-de-pretransfer
^Prpers<prn><subj><p3><m><sg>$ ^lijken<vblex><pri><p2><sg>$ ^sterk<adv>$ ^op<pr>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$

we move the particle to get:

^Prpers<prn><subj><p3><m><sg>$ ^lijken# op<vblex><pri><p2><sg>$ ^sterk<adv>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$

The __dutch mwedef below could be used to check this. However, since we already check for the existence of certain tags on both parts, the module could be expanded to allow for transfer-type <test>'s, see the mwedef __agreement below.

<mwdictionary>
  <section-def-cats>
    <def-cat n="det">
      <cat-item tags="det.*"/>
      <cat-item lemma="foo" tags="prn"/> <!-- treat foo<prn> as a determiner wrt. mwe's -->
    </def-cat>
    <def-cat n="adv">
      <cat-item tags="adv"/>
    </def-cat>
    <zdef-cat n="noun">
      <cat-item tags="n.*"/>
    </def-cat>
    <def-cat n="adj">
      <cat-item tags="adj.*"/>
    </def-cat>
    <def-cat n="verb">
      <cat-item tags="vblex.*"/>
    </def-cat>
    <def-cat n="prep">
      <cat-item tags="pr"/>
    </def-cat>
    <def-cat n="anything">
      <cat-item tags="*"/>
    </def-cat>
  </section-def-cats>
  <section-def-attrs>
    <def-attr n="nbr">
      <attr-item tags="sg"/>
      <attr-item tags="pl"/>
    </def-attr>
    <def-attr n="gen">
      <attr-item tags="m"/>
      <attr-item tags="f"/>
    </def-attr>
    <def-attr n="art">
      <attr-item tags="ind"/>
      <attr-item tags="def"/>
    </def-attr>
  </section-def-attrs>

  <mwedefs>
    <mwedef n="__dutch">
      <pattern>
        <pattern-item n="verb" />
        <pattern-item n="prep" />
      </pattern>
      <allow>
        <pattern-item n="adv"/> <!-- adv refers to the def-cat above -->
      </allow>
      <tags><clip pos="1" part="tags"/></tags>
    </mwedef>

    <mwedef n="__afrikaans">
      <pattern>
        <pattern-item n="verb" />
        <pattern-item n="prep" />
      </pattern>
      <allow>
        <pattern-item n="adj"/>
        <pattern-item n="noun"/>
        <pattern-item n="det"/>
        <pattern-item n="adv"/>
      </allow>
      <tags><clip pos="1" part="tags"/></tags>
    </mwedef>

    <mwedef n="__norsk">
      <pattern>
        <pattern-item n="verb" />
        <pattern-item n="prep" />
      </pattern>
      <allow>
        <pattern-item n="anything"/>
      </allow>
      <tags><clip pos="1" part="tags"/></tags>
    </mwedef>
    
    <mwedef n="__agreement">
      <pattern>
        <pattern-item n="adj" />
        <pattern-item n="noun" />
      </pattern>
      <allow>
        <!-- allow nothing in between -->
      </allow>
      <test><and>
        <equal><clip pos="1" part="nbr"/><clip pos="2" part="nbr"/></equal>
        <equal><clip pos="1" part="gen"/><clip pos="2" part="gen"/></equal>
      </and></test>
      <tags>
        <lit-tag n="n"/>
        <clip pos="1" part="gen"/>
        <clip pos="1" part="nbr"/>
        <clip pos="1" part="art"/>
      </tags>
    </mwedef>
  </mwedefs>

  <mwes>
    <e><p>
      <l lemma="lijken" />
      <l lemma="op" />
      <r lemma="lijken# op" />
    </p><par n="__dutch"/></e>

    <e><p>
      <l lemma="kondig" />
      <l lemma="aan" />
      <r lemma="aankondig" />
    </p><par n="__afrikaans"/></e>

    <e><p>
      <l lemma="rå" />
      <l lemma="til" />
      <r lemma="rå# til" />
    </p><par n="__norsk"/></e>

    <e><p>
      <l lemma="bærbar" />
      <l lemma="datamaskin" />
      <r lemma="bærbar datamaskin" />
    </p><par n="__agreement"/></e>
  </mwes>
</mwdictionary>

Examples[edit]

  • have a cold -> estar agripado