Difference between revisions of "Northern Sámi and Norwegian/bidix"
Line 42: | Line 42: | ||
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations. |
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations. |
||
==PlcSur__np== |
|||
For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that. |
|||
If the translation is identical no matter whether it's Plc or Sur, we use a pardef: |
|||
<pre> |
|||
<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e> |
|||
</pre> |
|||
If it's not, we do like this: |
|||
<pre> |
|||
<e> <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e> |
|||
<e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e> |
|||
</pre> |
|||
(since we should never change surnames in translation). |
|||
* Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess. |
|||
Revision as of 11:26, 13 April 2012
The apertium-sme-nob bidix makes heavy use of bidix pardefs. There are two main uses for these:
- To change from sme PoS tags to nob PoS tags
- To mark certain sme verbs as inherently passive/causative/reflexive
- these markings again triggers certain transfer rules, most of them in the chunker (t1x)
The most complex part of the bidix is probably the verb section. A typical one looks like:
<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. However, we can also have another pardef which does the same thing but also adds a causative tag "caus" which is picked up by transfer:
<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>
Here transfer will try to make a causative construction with this verb, by prepending "la" and putting the finite temps there while making the verb infinite.
Similarly, with
<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>
we get a reflexive (seg/meg/...) appended by transfer on seeing the "refl" tag added by refl__verb
.
With
<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>
we get a "pass" tag and a passive construction, with a participle (here: bli slitt). However, with the passive, the predicate might also be an adjective, which we mark like this:
<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>
(other parts of speech for the passive predicates are currently TODO-marked in bidix)
The deverbal__n
pardef is used to give lemma-specific overrides for the derivations (Der2.Actor, Der3.Der_n) which turn verbs into nouns:
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e> <e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="deverbal__n"/></e>
(see Northern_Sámi_and_Norwegian/Derivations
It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.
PlcSur__np
For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that.
If the translation is identical no matter whether it's Plc or Sur, we use a pardef:
<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e>
If it's not, we do like this:
<e> <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e> <e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e>
(since we should never change surnames in translation).
- Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.