Northern Sámi and Norwegian/bidix

The apertium-sme-nob bidix makes heavy use of bidix pardefs. The main uses for these are:

To change the tag format from the Giellatekno standard to the apertium standard
To mark certain sme verbs as inherently passive/causative/reflexive
- these markings again triggers certain transfer rules, most of them in the chunker (t1x)
To transfer from one part of speech to another

The most complex part of the bidix is probably the verb section. A typical one looks like:

<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating vurken, the tags <V><IV><Ind><Prs><Sg1> are turned into <vblex><pers><pres><sg><p1> by bidix, then the transfer rules distribute the tags <vblex><pres> onto the verb lemma, creating oppbevarer (and perhaps insert a pronoun using the other tags, creating jeg oppbevarer). Additionally, the pardef handles certain derivations, so when translating vurkejuvvot, the tags <V><TV><Der3><Der_PassL><V><Inf> will turn into <vblex><pers><inf><pass>, transfer rules add <vblex><inf><pass> to the verb lemma, creating oppbevares.

However, we can also have another pardef which, in addition to the above, also adds a causative tag <caus> which is picked up by transfer:

<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>

Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given divuhin, tagged <V><TV><Ind><Prt><Sg1>, bidix will output ^reparere<vblex><pers><caus><pret><sg><p1>$, and when transfer sees the verb is tagged >caus>, it creates ^la<vblex><pret>$ ^reparere<vblex><inf>$ (perhaps also inserting a pronoun as above).

Similarly, with

<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>

we get a reflexive (seg/meg/...) appended by transfer on seeing the "refl" tag added by refl__verb.

With

<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>

we get a "pass" tag and a passive construction, with a participle (here: bli slitt). However, with the passive, the predicate might also be an adjective, which we mark like this:

<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>

(other parts of speech for the passive predicates are currently TODO-marked in bidix)

The deverbal__n pardef is used to give lemma-specific overrides for the derivations (Der2.Actor, Der3.Der_n) which turn verbs into nouns:

<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="deverbal__n"/></e>

(see Northern_Sámi_and_Norwegian/Derivations

It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.

PlcSur__np

For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that.

If the translation is identical no matter whether it's Plc or Sur, we use a pardef:

<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e>

If it's not, we do like this:

<e>       <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e>
<e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e>

(since we should never change surnames in translation).

Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.
- or just add Sur from http://www.census.gov/genealogy/names/ http://www.ssb.no/navn/alf/main.html http://no.wikipedia.org/wiki/Kategori:Etternavn and remove the rule

Northern Sámi and Norwegian/bidix

PlcSur__np

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools