Northern Sámi and Norwegian/bidix

The apertium-sme-nob bidix makes heavy use of bidix pardefs. The main uses for these are:

To change the tag format from the Giellatekno standard to the apertium standard
To mark certain sme verbs as inherently passive/causative/reflexive
- these markings again triggers certain transfer rules, most of them in the chunker (t1x)
To transfer from one part of speech to another

Verb pardefs

The most complex part of the bidix is probably the verb section. A typical one looks like:

<e><p><l>vurket<s n="V"/><s n="TV"/></l><r>oppbevare<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

where "pers" marks that the agent is typically animate, and __verb handles the changes in tags for person, number, temps. When translating vurken, the tags <V><IV><Ind><Prs><Sg1> are turned into <vblex><pers><pres><sg><p1> by bidix, then the transfer rules distribute the tags <vblex><pres> onto the verb lemma, creating oppbevarer (and perhaps insert a pronoun using the other tags, creating jeg oppbevarer). Additionally, the pardef handles certain derivations, so when translating vurkejuvvot, the tags <V><TV><Der3><Der_PassL><V><Inf> will turn into <vblex><pers><inf><pass>, transfer rules add <vblex><inf><pass> to the verb lemma, creating oppbevares.

However, we can also have another pardef which, in addition to the above, also adds a causative tag <caus> which is picked up by transfer:

<e><p><l>divuhit<s n="V"/><s n="TV"/></l><r>reparere<s n="vblex"/><s n="pers"/></r></p><par n="caus__verb"/></e>

Here transfer will try to make a causative construction with this verb, by prepending "la" and distributing the finite temps tag there, while making the verb infinite. Thus given divuhin, tagged <V><TV><Ind><Prt><Sg1>, bidix will output ^reparere<vblex><pers><caus><pret><sg><p1>$, and when transfer sees the verb is tagged >caus>, it creates ^la<vblex><pret>$ ^reparere<vblex><inf>$ (perhaps also inserting a pronoun as above).

Similarly, with

<e><p><l>viidánit<s n="V"/><s n="IV"/></l><r>spre<s n="vblex"/><s n="pers"/></r></p><par n="refl__verb"/></e>

we get a reflexive (seg/meg/...) appended by transfer on seeing the <refl> tag added by refl__verb.

With

<e><p><l>suovganit<s n="V"/><s n="IV"/></l><r>slite<s n="vblex"/><s n="pers"/></r></p><par n="pass__verb"/></e>

we get a "pass" tag and a passive construction, with a participle (here: bli slitt). However, with the passive, the predicate can also be an adjective, which we mark like this:

<e><p><l>viessat<s n="V"/><s n="IV"/></l><r>trøtt<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/></e>

(other parts of speech for the passive predicates are currently TODO-marked in bidix)

It's up to transfer (mainly the chunker, t1x) to make sense of and clean up these tag combinations.

Pardef	Description	Example	Usage notes
__verb	Regular verb transfer	vurken → (jeg) oppbevarer
pass__verb	sme verb to nob dynamic passive	áibat → bli forsinket	use lemma forsinke in the <r>; this pardef also works with adjectives (e.g. čuččodit translates to <r>stående<s n="adj"/><s n="pers"/></r></p><par n="pass__verb"/>, bli stående)
pstv__verb	sme verb to nob lexicalised passive	čoggot → samles	use lemma samles in the <r>
refl__verb	sme verb to nob reflexive construction	ceagganit → reise seg	use lemma reise in the <r>

PlcSur__np

For all Plc-tagged proper noun lemmas in bidix, we have to have a Sur-tagged entry too. Even though "Hammerfeasta" is never used as a Sur, sme-dis.rle (and thus apertium-sme-nob.sme-nob.rlx) has a rule that can change arbitrary Plc-tagged proper nouns to Sur. So bidix has to be able to handle that.

If the translation is identical no matter whether it's Plc or Sur, we use a pardef:

<e><p><l>Isuzu<s n="N"/><s n="Prop"/></l><r>Isuzu<s n="np"/><s n="top"/></r></p><par n="PlcSur__np"/></e>

If it's not, we do like this:

<e>       <p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Plc"/></l><r>Emmenesveten<s n="np"/><s n="top"/></r></p><par n="__np"/></e>
<e r="LR"><p><l>Ádjáčohkka<s n="N"/><s n="Prop"/><s n="Sur"/></l><r>Ádjáčohkka<s n="np"/><s n="top"/></r></p><par n="__np"/></e>

(since we should never change surnames in translation).

Idea: run a huge corpus through CG with --trace, grep for the rule that changes Plc to Sur, add any such lemmas into lexc as Sur, and get rid of the whole PlcSur mess.
- or just add Sur from http://www.census.gov/genealogy/names/ http://www.ssb.no/navn/alf/main.html http://no.wikipedia.org/wiki/Kategori:Etternavn and remove the rule

Northern Sámi and Norwegian/bidix

Verb pardefs

PlcSur__np

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools