Talk:Northern Sámi and Norwegian

Transfer strategy

So far I've been thinking this:

t1x: chunking
- Turn adjectives and nouns into SN chunks, give them the right gender and number
- Derivations into phrases?
t2x: movement
- Put adpositions in front of SN chunks
- In general move SN chunks around verbs, adverbs etc. to get right word order
- Guess definiteness from word order, case, syntactic function
t3x: cleanup
- Eg. if definiteness changed, make sure adj tags are consistent

We could also do:

t1x: light chunking (SN, ...)
t2x: more chunking (Relatives, subordinate clauses)
t3x: moving around and stuff
t4x: cleanup.

- Francis Tyers 18:32, 18 January 2010 (UTC)

The 1-4 are different files, is that it? There are both easy and hard issues when it comes to phrases, this speaks in favour of 4. But the clear-cut criterion for light vs. heavy?Trondtr 12:26, 19 January 2010 (UTC).

We'll need rules to cover both compounding and derivation, this speaks for 4-stage (eg. each noun could be a compound, multiplying each noun rule by two--or more if we have longer compounds?). We need to figure out what phenomena go in what stage though.unhammer 13:09, 19 January 2010 (UTC)

t1x
- (de-)compounding,
- derivation,
- simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
- simple periphrastic verb combinations (verb, vaux pp, vaux inf)
t2x
- relatives (SN "who" SV -> SN)
- co-ordination (SN "and" SN -> SN)
- genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
t3x
- move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
- V2? --unhammer 13:04, 20 January 2010 (UTC) +1 Francis Tyers
- Insert dropped pronouns? (Or tags for them?)--unhammer 14:25, 20 January 2010 (UTC) +1 Francis Tyers
t4x
- Insert prepositions.
- Insert articles? --unhammer 13:32, 20 January 2010 (UTC)
- Cleanup

- Francis Tyers 14:37, 19 January 2010 (UTC)

Level	Description	Test case
t1x	(de-)compounding	Politiijastašuvnna

Wishlist / Difficulties with the architecture / Ugly hacks

Clipping a substring in transfer (or any better solution)

For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context.

Having the preposition in a tag is rather ugly.

We ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.

No UTF in sdefs

@←SPRED is not a valid sdef. Is this just because of it being an XML ID?

I think you can have UTF-8 in sdefs, but they cannot start with a non alphabetic character. - Francis Tyers 13:34, 10 February 2010 (UTC)

Automatic numbering for chunk tag variables

We use stuff like <lit-tag v="4"/> in the <lu>'s in t1x to say that this tag should be assigned the fourth chunk tag after postchunk, eg. <ind>. This is a handy feature which lightens the load on postchunk, but it hasn't been utilised to the fullest…

We currently have to use a variable to keep that number instead of just inserting the lit-tag directly in the lexical unit, since we don't know if it'll be the fourth or the fifth chunk tag or what; some tags may be empty (the Qst chunk tag is only there if the word has the question particle on it; superlative adjectives have no number/gender, etc.). The variable has to be set in each rule, and sometimes we have several chunk tags which may be empty or not at once (an adjective may be superlative or not, and have a question particle or not).

Idea: <chunk-pos part="art"/> would insert <4> if a member of the def-attr "art" appeared at place 4, etc., no tag if no member of "art" appears in the chunk tags. This number would have to be computed at run-time, but that's the way it is anyway with our variable; and this tag would make the transfer files a lot more explicit and cleaner (the first time I saw <lit-tag v="4"/> a little part of me died).

Multiple mapping tags

This is a problem:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin 
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN><@APP-N←><@SUBJ><@SPRED>$
^./.<CLB>$

since we run cg-proc after apertium-tagger. The second run of cg-proc bails out on seeing several mapping tags here:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin |apertium-tagger -p -g sme-nob.prob |cg-proc -w -n sme-nob.lex.bin
Error: addTagToReading() cannot add a mapping tag to a reading which already is mapped!

Possible solutions:

Manually ensure we always end up with just one mapping tag in CG (eg. with an AFTER-SECTIONS rule to pick an arbitrary tag)
- Possibly simple, but bad practice
Make cg-proc discard all but the first one

Talk:Northern Sámi and Norwegian

Contents

Transfer strategy

Wishlist / Difficulties with the architecture / Ugly hacks

Clipping a substring in transfer (or any better solution)

No UTF in sdefs

Automatic numbering for chunk tag variables

Multiple mapping tags

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools