Talk:Multiwords
Contents
Yet another
<mw n="dirección general"> <lu lemma="dirección" tags="n.*" head/> <lu lemma="general" tags="adj.mf.*"/> </mw> <mw n="zračna luka"> <lu lemma="zračna" tags="adj.*"/> <lu lemma="luka" tags="n.*" head/> </mw>
Tags from the lu marked "head" are preserved, where tags for others are removed. So the output would be:
^dirección<n><f><sg>$ ^general<adj><mf><sg>$ → ^dirección general<n><f><sg>$
While generation would look like:
^dirección general<n><f><pl>$ → ^dirección<n><f><pl>$ ^general<adj><mf><pl>$
Note how the tags marked in tags
are preserved, where the rest are copied.
- How could this be coded?
Another option
<spectie> jimregan, you might be able to just do it with a dictionary <jimregan> I'm listening <spectie> ok <spectie> so imagine: <jimregan> (err... well, reading :) <spectie> ah no <spectie> because you'd need to enumerate the tags <spectie> although, that might not be so difficult if we have lt-expand <spectie> ok <spectie> here: <spectie> <e> <spectie> <p> <spectie> <l>strajk<s n="n"/><s n="m"/><s n="sg"/><s n="nom"/><b/>włoski<s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></l> <spectie> <r>strajk<b/>włoski<s n="n"/><s n="m"/><s n="sg"/><s n="nom"/></r> <spectie> </p> <spectie> </e> <spectie> <spectie> then you just run it through the lt-proc again with a special mode set <spectie> you'd run that before the transfer <spectie> and it would work for both analysis and generation
And another
<jimregan> something like this <jimregan> <multiword n="noun-adj_np.top_f"> <jimregan> <replacements> <jimregan> <replace><l><s n="adj"/></l><r><s n="np"/><s n="top"/></r></replace> <jimregan> </replacements> <jimregan> <join> <jimregan> <i><s n="f"/></i> <jimregan> <i><s n="nom"/></i> <jimregan> <i><s n="gen"/></i> <jimregan> <i><s n="acc"/></i> <jimregan> <i><s n="dat"/></i> <jimregan> <i><s n="loc"/></i> <jimregan> <i><s n="ins"/></i> <jimregan> <i><s n="voc"/></i> <jimregan> </join> <jimregan> <restrict> <jimregan> <i><s n="f"/></i> <jimregan> <i><s n="sg"/></i> <jimregan> </restrict> <jimregan> </multiword> <jimregan> <multiword n="noun-adj_noun"> <jimregan> <replacements> <jimregan> <replace><l><s n="adj"/></l><r><s n="n"/></r></replace> <jimregan> <replace><l><s n="m"/></l><r><s n="m3"/></r></replace> <jimregan> </replacements> <jimregan> <join> <jimregan> <i><s n="nom"/></i> <jimregan> <i><s n="gen"/></i> <jimregan> <i><s n="acc"/></i> <jimregan> <i><s n="dat"/></i> <jimregan> <i><s n="loc"/></i> <jimregan> <i><s n="ins"/></i> <jimregan> <i><s n="voc"/></i> <jimregan> </join> <jimregan> <restrict> <jimregan> <i><s n="sg"/></i> <jimregan> </restrict> <jimregan> </multiword> <jimregan> <mw lm="Wielka Brytania" type="noun-adj_np.top_f"> <jimregan> <i>Wiel</i><par n="wiel/ki__adj"/> <jimregan> <i><b/></i> <jimregan> <i>Brytani</i><par n="Francj/a__np"/> <jimregan> </mw> <jimregan> <mw lm="strajk włoski" type="noun-adj_noun"> <jimregan> <i>strajk</i><par n="maluch/__n"/> <jimregan> <i><b/></i> <jimregan> <i>włos</i><par n="pols/ki__adj"/> <jimregan> </mw> <spectie> hmm <spectie> whats the "join" thing ? <jimregan> oops. wasn't meant to have '<i><s n="f"/></i>' in the '<join>' of the first, just in <restrict> <jimregan> where that tag exists in each parameter, use that as output <spectie> where would this be called ? <spectie> after analysis ? <jimregan> possibly, but for the moment I'm thinking of adding it as a generated subsection of the analyser <spectie> what do you reckon to my idea ? <jimregan> each 'mw' would be expanded to an '<e>' <jimregan> the problem is that I don't want to keep the adjective pardefs as simple as possible <spectie> you don't ? <jimregan> 'strajk wloski' would have to be 'm3', not 'm' <spectie> aha <jimregan> but in most cases it doesn't make sense to have the adjectives consider masculine gender subtypes separately <spectie> ah ok <spectie> i was thinking of putting in mine after tagging <jimregan> so I want to have a stylesheet replace 'adj.m' with 'n.m3' in the strajk wloski case <spectie> hmm <spectie> it would work <spectie> you could make the "<join>" thing a paradigm <spectie> e.g. <pardef n="cases"><e><i><s n="nom"/></i></e> ... </pardef> <join><par n="cases"/></join> <jimregan> aha <jimregan> yes
agreement multiwords (complex multiwords) in bidix/transfer
The assumption in lt-mwpp is that we can treat an adj+noun multiword as a single noun, similarly to creating a single entry for a compound noun. We can then have a bidix entry like
<e><p><l>mátkedihtor<s n="N"/></l><r>bærbar<b/>datamaskin<s n="n"/><s n="m"/></r></p><par n="__n"/></e>
even though "bærbar" is an adjective. However, there is a problem when generating Bokmål definite nouns from Northern Sámi. If they are preceded by an adjective, they need a determiner inserted, while bare definite nouns should not have a determiner inserted. If treated like a single noun, we get:
$ echo Dihtor lei doppe|apertium -d . sme-nob Datamaskinen var der borte $ echo Ođđa dihtor lei doppe|apertium -d . sme-nob Den nye datamaskinen var der borte $ echo Mátkedihtor lei doppe|apertium -d . sme-nob Bærbare datamaskinen var der borte # should have "den" before it
The simplest solution I can think of is to just add a certain tag to these mwe's (in the mwe dictionary), the transfer rules can then insert a determiner if we have an adj, or the noun has that certain tag.
Possible discontiguous + agreement mwe module
The module for processing discontiguous mwe's could also check for agreement before chunking these. In that case, the two types of mwe's could be merged into one module, which would let hfst users also merge agreement mwe's.
The module runs after pretransfer, so with
$ echo Hij lijkt sterk op mij |apertium -d . nl-de-pretransfer ^Prpers<prn><subj><p3><m><sg>$ ^lijken<vblex><pri><p2><sg>$ ^sterk<adv>$ ^op<pr>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$
we move the particle to get:
^Prpers<prn><subj><p3><m><sg>$ ^lijken# op<vblex><pri><p2><sg>$ ^sterk<adv>$ ^prpers<prn><obj><p1><mf><sg>$^.<sent>$
The __dutch mwedef below could be used to check this. However, since we already check for the existence of certain tags on both parts, the module could be expanded to allow for transfer-type <test>'s, see the mwedef __agreement below.
<mwdictionary> <section-def-cats> <def-cat n="det"> <cat-item tags="det.*"/> <cat-item lemma="foo" tags="prn"/> <!-- treat foo<prn> as a determiner wrt. mwe's --> </def-cat> <def-cat n="adv"> <cat-item tags="adv"/> </def-cat> <def-cat n="noun"> <cat-item tags="n.*"/> </def-cat> <def-cat n="adj"> <cat-item tags="adj.*"/> </def-cat> <def-cat n="verb"> <cat-item tags="vblex.*"/> </def-cat> <def-cat n="prep"> <cat-item tags="pr"/> </def-cat> <def-cat n="anything"> <cat-item tags="*"/> </def-cat> </section-def-cats> <section-def-attrs> <def-attr n="nbr"> <attr-item tags="sg"/> <attr-item tags="pl"/> </def-attr> <def-attr n="gen"> <attr-item tags="m"/> <attr-item tags="f"/> </def-attr> <def-attr n="art"> <attr-item tags="ind"/> <attr-item tags="def"/> </def-attr> </section-def-attrs> <mwedefs> <mwedef n="__dutch"> <pattern> <pattern-item n="verb" /> <pattern-item n="prep" /> </pattern> <allow> <pattern-item n="adv"/> <!-- adv refers to the def-cat above --> </allow> <tags><clip pos="1" part="tags"/></tags> </mwedef> <mwedef n="__afrikaans"> <pattern> <pattern-item n="verb" /> <pattern-item n="prep" /> </pattern> <allow> <pattern-item n="adj"/> <pattern-item n="noun"/> <pattern-item n="det"/> <pattern-item n="adv"/> </allow> <tags><clip pos="1" part="tags"/></tags> </mwedef> <mwedef n="__norsk"> <pattern> <pattern-item n="verb" /> <pattern-item n="prep" /> </pattern> <allow> <pattern-item n="anything"/> </allow> <tags><clip pos="1" part="tags"/></tags> </mwedef> <mwedef n="__agreement"> <pattern> <pattern-item n="adj" /> <pattern-item n="noun" /> </pattern> <allow> <!-- allow nothing in between --> </allow> <test><and> <equal><clip pos="1" part="nbr"/><clip pos="2" part="nbr"/></equal> <equal><clip pos="1" part="gen"/><clip pos="2" part="gen"/></equal> </and></test> <tags> <lit-tag n="n"/> <clip pos="1" part="gen"/> <clip pos="1" part="nbr"/> <clip pos="1" part="art"/> </tags> </mwedef> </mwedefs> <mwes> <e><p> <l lemma="lijken" /> <l lemma="op" /> <r lemma="lijken# op" /> </p><par n="__dutch"/></e> <e><p> <l lemma="kondig" /> <l lemma="aan" /> <r lemma="aankondig" /> </p><par n="__afrikaans"/></e> <e><p> <l lemma="rå" /> <l lemma="til" /> <r lemma="rå# til" /> </p><par n="__norsk"/></e> <e><p> <l lemma="bærbar" /> <l lemma="datamaskin" /> <r lemma="bærbar datamaskin" /> </p><par n="__agreement"/></e> </mwes> </mwdictionary>