Difference between revisions of "Northern Sámi and Norwegian/bidix"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
The [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.dix apertium-sme-nob bidix] makes heavy use of bidix pardefs. There are two main uses for these:
+
The [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.dix apertium-sme-nob bidix] makes heavy use of bidix pardefs. The main uses for these are:
* To change from sme PoS tags to nob PoS tags
+
* To change the tag format from the Giellatekno standard to the apertium standard
 
* To mark certain sme verbs as inherently passive/causative/reflexive
 
* To mark certain sme verbs as inherently passive/causative/reflexive
 
** these markings again triggers certain transfer rules, most of them in the chunker ([http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.t1x t1x])
 
** these markings again triggers certain transfer rules, most of them in the chunker ([http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.t1x t1x])
  +
* To transfer from one part of speech to another
   
  +
==Verb pardefs==
 
The most complex part of the bidix is probably the verb section. A typical one looks like:
 
The most complex part of the bidix is probably the verb section. A typical one looks like:
 
<pre>
 
<pre>
 
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
 
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
 
</pre>
 
</pre>
  +
where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating ''vurken'', the tags &lt;V&gt;&lt;IV&gt;&lt;Ind&gt;&lt;Prs&gt;&lt;Sg1&gt; are turned into &lt;vblex&gt;&lt;pers&gt;&lt;pres&gt;&lt;sg&gt;&lt;p1&gt; by bidix, then the transfer rules distribute the tags &lt;vblex&gt;&lt;pres&gt; onto the verb lemma, creating ''oppbevarer'' (and perhaps insert a pronoun using the other tags, creating ''jeg oppbevarer''). Additionally, the pardef handles certain derivations, so when translating ''vurkejuvvot'', the tags &lt;V&gt;&lt;TV&gt;&lt;Der3&gt;&lt;Der_PassL&gt;&lt;V&gt;&lt;Inf&gt; will turn into &lt;vblex&gt;&lt;pers&gt;&lt;inf&gt;&lt;pass&gt;, transfer rules add &lt;vblex&gt;&lt;inf&gt;&lt;pass&gt; to the verb lemma, creating ''oppbevares''.
where "pers" marks that the agent is typically animate, and __verb
 
  +
handles the changes in tags for person, number, temps. However, we can
 
  +
also have another pardef which does the same thing but also adds a
 
causative tag "caus" which is picked up by transfer:
+
However, we can also have another pardef which, in addition to the above, also adds a causative tag &lt;caus&gt; which is picked up by transfer:
 
<pre>
 
<pre>
 
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>
 
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>
 
</pre>
 
</pre>
Here transfer will try to make a causative construction with this verb, by prepending "la" and putting the finite temps there while making the verb infinite.
+
Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given ''divuhin'', tagged &lt;V&gt;&lt;TV&gt;&lt;Ind&gt;&lt;Prt&gt;&lt;Sg1&gt;, bidix will output ^reparere&lt;vblex&gt;&lt;pers&gt;&lt;caus&gt;&lt;pret&gt;&lt;sg&gt;&lt;p1&gt;$, and when transfer sees the verb is tagged &gt;caus&gt;, it creates ^la&lt;vblex&gt;&lt;pret&gt;$ ^reparere&lt;vblex&gt;&lt;inf&gt;$ (perhaps also inserting a pronoun as above).
   
 
Similarly, with
 
Similarly, with
Line 21: Line 23:
 
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>
 
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>
 
</pre>
 
</pre>
we get a reflexive (seg/meg/...) appended by transfer on seeing the "refl" tag added by <code>refl__verb</code>.
+
we get a reflexive (seg/meg/...) appended by transfer on seeing the &lt;refl&gt; tag added by <code>refl__verb</code>.
   
 
With
 
With
 
 
<pre>
 
<pre>
 
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>
 
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>
 
</pre>
 
</pre>
we get a "pass" tag and a passive construction, with a participle (here: ''bli slitt''). However, with the passive, the predicate might also be an adjective, which we mark like this:
+
we get a "pass" tag and a passive construction, with a participle (here: ''bli slitt''). However, with the passive, the predicate can also be an adjective, which we mark like this:
 
<pre>
 
<pre>
 
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>
 
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>
Line 34: Line 35:
 
(other parts of speech for the passive predicates are currently TODO-marked in bidix)
 
(other parts of speech for the passive predicates are currently TODO-marked in bidix)
   
The <code>deverbal__n</code> pardef is used to give lemma-specific overrides for the derivations (Der2.Actor, Der3.Der_n) which turn verbs into nouns:
 
<pre>
 
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
 
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="deverbal__n"/></e>
 
</pre>
 
(see [[Northern_S%C3%A1mi_and_Norwegian/Derivations]]
 
   
 
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.
 
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.
  +
  +
{|class="wikitable sortable"
  +
! Pardef !! Description !! Example !! Usage notes
  +
|-
  +
| __verb || Regular verb transfer || vurken → (jeg) oppbevarer ||
  +
|-
  +
| pass__verb || sme verb to nob dynamic passive || áibat → bli forsinket || use lemma ''forsinke'' in the &lt;r&gt;; this pardef also works with adjectives (e.g. čuččodit translates to &lt;r&gt;stående&lt;s n="adj"/&gt;&lt;s n="pers"/&gt;&lt;/r&gt;&lt;/p&gt;&lt;par n="pass__verb"/&gt;, ''bli stående'')
  +
|-
  +
| pstv__verb || sme verb to nob lexicalised passive || čoggot → samles || use lemma ''samles'' in the &lt;r&gt;
  +
|-
  +
| refl__verb || sme verb to nob reflexive construction || ceagganit → reise seg || use lemma ''reise'' in the &lt;r&gt;
  +
|-
  +
|}
  +
  +
   
   
Line 60: Line 70:
   
 
* Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.
 
* Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.
  +
** or just add Sur from http://www.census.gov/genealogy/names/ http://www.ssb.no/navn/alf/main.html http://no.wikipedia.org/wiki/Kategori:Etternavn and remove the rule
 
   
 
[[Category:Northern Sámi and Norwegian]]
 
[[Category:Northern Sámi and Norwegian]]

Latest revision as of 09:44, 24 August 2012

The apertium-sme-nob bidix makes heavy use of bidix pardefs. The main uses for these are:

  • To change the tag format from the Giellatekno standard to the apertium standard
  • To mark certain sme verbs as inherently passive/causative/reflexive
    • these markings again triggers certain transfer rules, most of them in the chunker (t1x)
  • To transfer from one part of speech to another

Verb pardefs[edit]

The most complex part of the bidix is probably the verb section. A typical one looks like:

<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating vurken, the tags <V><IV><Ind><Prs><Sg1> are turned into <vblex><pers><pres><sg><p1> by bidix, then the transfer rules distribute the tags <vblex><pres> onto the verb lemma, creating oppbevarer (and perhaps insert a pronoun using the other tags, creating jeg oppbevarer). Additionally, the pardef handles certain derivations, so when translating vurkejuvvot, the tags <V><TV><Der3><Der_PassL><V><Inf> will turn into <vblex><pers><inf><pass>, transfer rules add <vblex><inf><pass> to the verb lemma, creating oppbevares.


However, we can also have another pardef which, in addition to the above, also adds a causative tag <caus> which is picked up by transfer:

<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>

Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given divuhin, tagged <V><TV><Ind><Prt><Sg1>, bidix will output ^reparere<vblex><pers><caus><pret><sg><p1>$, and when transfer sees the verb is tagged >caus>, it creates ^la<vblex><pret>$ ^reparere<vblex><inf>$ (perhaps also inserting a pronoun as above).

Similarly, with

<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>

we get a reflexive (seg/meg/...) appended by transfer on seeing the <refl> tag added by refl__verb.

With

<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>

we get a "pass" tag and a passive construction, with a participle (here: bli slitt). However, with the passive, the predicate can also be an adjective, which we mark like this:

<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>

(other parts of speech for the passive predicates are currently TODO-marked in bidix)


It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.

Pardef Description Example Usage notes
__verb Regular verb transfer vurken → (jeg) oppbevarer
pass__verb sme verb to nob dynamic passive áibat → bli forsinket use lemma forsinke in the <r>; this pardef also works with adjectives (e.g. čuččodit translates to <r>stående<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/>, bli stående)
pstv__verb sme verb to nob lexicalised passive čoggot → samles use lemma samles in the <r>
refl__verb sme verb to nob reflexive construction ceagganit → reise seg use lemma reise in the <r>



PlcSur__np[edit]

For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that.

If the translation is identical no matter whether it's Plc or Sur, we use a pardef:

<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e>

If it's not, we do like this:

<e>       <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e>
<e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e>

(since we should never change surnames in translation).