Talk:Northern Sámi and Norwegian

From Apertium
Jump to navigation Jump to search

Transfer strategy

So far I've been thinking this:

  • t1x: chunking
    • Turn adjectives and nouns into SN chunks, give them the right gender and number
    • Derivations into phrases?
  • t2x: movement
    • Put adpositions in front of SN chunks
    • In general move SN chunks around verbs, adverbs etc. to get right word order
    • Guess definiteness from word order, case, syntactic function
  • t3x: cleanup
    • Eg. if definiteness changed, make sure adj tags are consistent


We could also do:
  • t1x: light chunking (SN, ...)
  • t2x: more chunking (Relatives, subordinate clauses)
  • t3x: moving around and stuff
  • t4x: cleanup.

- Francis Tyers 18:32, 18 January 2010 (UTC)

The 1-4 are different files, is that it? There are both easy and hard issues when it comes to phrases, this speaks in favour of 4. But the clear-cut criterion for light vs. heavy?Trondtr 12:26, 19 January 2010 (UTC).

We'll need rules to cover both compounding and derivation, this speaks for 4-stage (eg. each noun could be a compound, multiplying each noun rule by two--or more if we have longer compounds?). We need to figure out what phenomena go in what stage though.unhammer 13:09, 19 January 2010 (UTC)
  • t1x
    • (de-)compounding,
    • derivation,
    • simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
    • simple periphrastic verb combinations (verb, vaux pp, vaux inf)
  • t2x
    • relatives (SN "who" SV -> SN)
    • co-ordination (SN "and" SN -> SN)
    • genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
  • t3x
    • move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
    • V2? --unhammer 13:04, 20 January 2010 (UTC) +1 Francis Tyers
    • Insert dropped pronouns? (Or tags for them?)--unhammer 14:25, 20 January 2010 (UTC) +1 Francis Tyers
  • t4x
    • Insert prepositions.
    • Insert articles? --unhammer 13:32, 20 January 2010 (UTC)
    • Cleanup
- Francis Tyers 14:37, 19 January 2010 (UTC)


Level Description Test case
t1x (de-)compounding Politiijastašuvnna

Wishlist / Difficulties with the architecture / Ugly hacks

Clipping a substring in transfer (or any better solution)

For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context.

Having the preposition in a tag is rather ugly.

We ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.

No UTF in sdefs

@←SPRED is not a valid sdef. Is this just because of it being an XML ID?

I think you can have UTF-8 in sdefs, but they cannot start with a non alphabetic character. - Francis Tyers 13:34, 10 February 2010 (UTC)

Automatic numbering for chunk tag variables

We use stuff like <lit-tag v="4"/> in the <lu>'s in t1x to say that this tag should be assigned the fourth chunk tag after postchunk, eg. <ind>. This is a handy feature which lightens the load on postchunk, but it hasn't been utilised to the fullest…

We currently have to use a variable to keep that number instead of just inserting the lit-tag directly in the lexical unit, since we don't know if it'll be the fourth or the fifth chunk tag or what; some tags may be empty (the Qst chunk tag is only there if the word has the question particle on it; superlative adjectives have no number/gender, etc.). The variable has to be set in each rule, and sometimes we have several chunk tags which may be empty or not at once (an adjective may be superlative or not, and have a question particle or not).

Idea: <chunk-pos part="art"/> would insert <4> if a member of the def-attr "art" appeared at place 4, etc., no tag if no member of "art" appears in the chunk tags. This number would have to be computed at run-time, but that's the way it is anyway with our variable; and this tag would make the transfer files a lot more explicit and cleaner (the first time I saw <lit-tag v="4"/> a little part of me died).

Multiple mapping tags

This is a problem:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin 
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN><@APP-N←><@SUBJ><@SPRED>$
^./.<CLB>$

since we run cg-proc after apertium-tagger. The second run of cg-proc bails out on seeing several mapping tags here:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin |apertium-tagger -p -g sme-nob.prob |cg-proc -w -n sme-nob.lex.bin
Error: addTagToReading() cannot add a mapping tag to a reading which already is mapped!

Possible solutions:

  • Manually ensure we always end up with just one mapping tag in CG (eg. with an AFTER-SECTIONS rule to pick an arbitrary tag)
    • Possibly simple, but bad practice
  • Make cg-proc discard all but the first one