Difference between revisions of "Northern Sámi and Norwegian/bidix"
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.dix apertium-sme-nob bidix] makes heavy use of bidix pardefs. |
The [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.dix apertium-sme-nob bidix] makes heavy use of bidix pardefs. The main uses for these are: |
||
* To change from |
* To change the tag format from the Giellatekno standard to the apertium standard |
||
* To mark certain sme verbs as inherently passive/causative/reflexive |
* To mark certain sme verbs as inherently passive/causative/reflexive |
||
** these markings again triggers certain transfer rules, most of them in the chunker ([http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.t1x t1x]) |
** these markings again triggers certain transfer rules, most of them in the chunker ([http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/apertium-sme-nob.sme-nob.t1x t1x]) |
||
* To transfer from one part of speech to another |
|||
==Verb pardefs== |
|||
The most complex part of the bidix is probably the verb section. A typical one looks like: |
The most complex part of the bidix is probably the verb section. A typical one looks like: |
||
<pre> |
<pre> |
||
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e> |
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e> |
||
</pre> |
</pre> |
||
where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating ''vurken'', the tags <V><IV><Ind><Prs><Sg1> are turned into <vblex><pers><pres><sg><p1> by bidix, then the transfer rules distribute the tags <vblex><pres> onto the verb lemma, creating ''oppbevarer'' (and perhaps insert a pronoun using the other tags, creating ''jeg oppbevarer''). Additionally, the pardef handles certain derivations, so when translating ''vurkejuvvot'', the tags <V><TV><Der3><Der_PassL><V><Inf> will turn into <vblex><pers><inf><pass>, transfer rules add <vblex><inf><pass> to the verb lemma, creating ''oppbevares''. |
|||
where "pers" marks that the agent is typically animate, and __verb |
|||
handles the changes in tags for person, number, temps. However, we can |
|||
also have another pardef which does the same thing but also adds a |
|||
causative tag |
However, we can also have another pardef which, in addition to the above, also adds a causative tag <caus> which is picked up by transfer: |
||
<pre> |
<pre> |
||
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e> |
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e> |
||
</pre> |
</pre> |
||
Here transfer will try to make a causative construction with this verb, by prepending "la" and |
Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given ''divuhin'', tagged <V><TV><Ind><Prt><Sg1>, bidix will output ^reparere<vblex><pers><caus><pret><sg><p1>$, and when transfer sees the verb is tagged >caus>, it creates ^la<vblex><pret>$ ^reparere<vblex><inf>$ (perhaps also inserting a pronoun as above). |
||
Similarly, with |
Similarly, with |
||
Line 21: | Line 23: | ||
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e> |
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e> |
||
</pre> |
</pre> |
||
we get a reflexive (seg/meg/...) appended by transfer on seeing the |
we get a reflexive (seg/meg/...) appended by transfer on seeing the <refl> tag added by <code>refl__verb</code>. |
||
With |
With |
||
<pre> |
<pre> |
||
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e> |
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e> |
||
</pre> |
</pre> |
||
we get a "pass" tag and a passive construction, with a participle (here: ''bli slitt''). However, with the passive, the predicate |
we get a "pass" tag and a passive construction, with a participle (here: ''bli slitt''). However, with the passive, the predicate can also be an adjective, which we mark like this: |
||
<pre> |
<pre> |
||
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e> |
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e> |
||
Line 34: | Line 35: | ||
(other parts of speech for the passive predicates are currently TODO-marked in bidix) |
(other parts of speech for the passive predicates are currently TODO-marked in bidix) |
||
The <code>deverbal__n</code> pardef is used to give lemma-specific overrides for the derivations (Der2.Actor, Der3.Der_n) which turn verbs into nouns: |
|||
<pre> |
|||
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e> |
|||
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="deverbal__n"/></e> |
|||
</pre> |
|||
(see [[Northern_S%C3%A1mi_and_Norwegian/Derivations]] |
|||
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations. |
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations. |
||
{|class="wikitable sortable" |
|||
! Pardef !! Description !! Example !! Usage notes |
|||
|- |
|||
| __verb || Regular verb transfer || vurken → (jeg) oppbevarer || |
|||
|- |
|||
| pass__verb || sme verb to nob dynamic passive || áibat → bli forsinket || use lemma ''forsinke'' in the <r>; this pardef also works with adjectives (e.g. čuččodit translates to <r>stående<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/>, ''bli stående'') |
|||
|- |
|||
| pstv__verb || sme verb to nob lexicalised passive || čoggot → samles || use lemma ''samles'' in the <r> |
|||
|- |
|||
| refl__verb || sme verb to nob reflexive construction || ceagganit → reise seg || use lemma ''reise'' in the <r> |
|||
|- |
|||
|} |
|||
Latest revision as of 09:44, 24 August 2012
The apertium-sme-nob bidix makes heavy use of bidix pardefs. The main uses for these are:
- To change the tag format from the Giellatekno standard to the apertium standard
- To mark certain sme verbs as inherently passive/causative/reflexive
- these markings again triggers certain transfer rules, most of them in the chunker (t1x)
- To transfer from one part of speech to another
Verb pardefs[edit]
The most complex part of the bidix is probably the verb section. A typical one looks like:
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating vurken, the tags <V><IV><Ind><Prs><Sg1> are turned into <vblex><pers><pres><sg><p1> by bidix, then the transfer rules distribute the tags <vblex><pres> onto the verb lemma, creating oppbevarer (and perhaps insert a pronoun using the other tags, creating jeg oppbevarer). Additionally, the pardef handles certain derivations, so when translating vurkejuvvot, the tags <V><TV><Der3><Der_PassL><V><Inf> will turn into <vblex><pers><inf><pass>, transfer rules add <vblex><inf><pass> to the verb lemma, creating oppbevares.
However, we can also have another pardef which, in addition to the above, also adds a causative tag <caus> which is picked up by transfer:
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>
Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given divuhin, tagged <V><TV><Ind><Prt><Sg1>, bidix will output ^reparere<vblex><pers><caus><pret><sg><p1>$, and when transfer sees the verb is tagged >caus>, it creates ^la<vblex><pret>$ ^reparere<vblex><inf>$ (perhaps also inserting a pronoun as above).
Similarly, with
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>
we get a reflexive (seg/meg/...) appended by transfer on seeing the <refl> tag added by refl__verb
.
With
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>
we get a "pass" tag and a passive construction, with a participle (here: bli slitt). However, with the passive, the predicate can also be an adjective, which we mark like this:
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>
(other parts of speech for the passive predicates are currently TODO-marked in bidix)
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.
Pardef | Description | Example | Usage notes |
---|---|---|---|
__verb | Regular verb transfer | vurken → (jeg) oppbevarer | |
pass__verb | sme verb to nob dynamic passive | áibat → bli forsinket | use lemma forsinke in the <r>; this pardef also works with adjectives (e.g. čuččodit translates to <r>stående<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/>, bli stående) |
pstv__verb | sme verb to nob lexicalised passive | čoggot → samles | use lemma samles in the <r> |
refl__verb | sme verb to nob reflexive construction | ceagganit → reise seg | use lemma reise in the <r> |
PlcSur__np[edit]
For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that.
If the translation is identical no matter whether it's Plc or Sur, we use a pardef:
<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e>
If it's not, we do like this:
<e> <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e> <e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e>
(since we should never change surnames in translation).
- Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.
- or just add Sur from http://www.census.gov/genealogy/names/ http://www.ssb.no/navn/alf/main.html http://no.wikipedia.org/wiki/Kategori:Etternavn and remove the rule