Talk:Northern Sámi and Norwegian
So far I've been thinking this:
- t1x: chunking
- Turn adjectives and nouns into SN chunks, give them the right gender and number
- Derivations into phrases?
- t2x: movement
- Put adpositions in front of SN chunks
- In general move SN chunks around verbs, adverbs etc. to get right word order
- Guess definiteness from word order, case, syntactic function
- t3x: cleanup
- Eg. if definiteness changed, make sure adj tags are consistent
- We could also do:
- t1x: light chunking (SN, ...)
- t2x: more chunking (Relatives, subordinate clauses)
- t3x: moving around and stuff
- t4x: cleanup.
- Francis Tyers 18:32, 18 January 2010 (UTC)
The 1-4 are different files, is that it? There are both easy and hard issues when it comes to phrases, this speaks in favour of 4. But the clear-cut criterion for light vs. heavy?Trondtr 12:26, 19 January 2010 (UTC).
- We'll need rules to cover both compounding and derivation, this speaks for 4-stage (eg. each noun could be a compound, multiplying each noun rule by two--or more if we have longer compounds?). We need to figure out what phenomena go in what stage though.unhammer 13:09, 19 January 2010 (UTC)
- simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
- simple periphrastic verb combinations (verb, vaux pp, vaux inf)
- relatives (SN "who" SV -> SN)
- co-ordination (SN "and" SN -> SN)
- genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
- Insert prepositions.
- Insert articles? --unhammer 13:32, 20 January 2010 (UTC)
- - Francis Tyers 14:37, 19 January 2010 (UTC)
Wishlist / Difficulties with the architecture / Ugly hacks
Clipping a substring in transfer (or any better solution)
For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context.
Having the preposition in a tag is rather ugly.
We ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.
No UTF in sdefs
@←SPRED is not a valid sdef. Is this just because of it being an XML ID?
- I think you can have UTF-8 in sdefs, but they cannot start with a non alphabetic character. - Francis Tyers 13:34, 10 February 2010 (UTC)
Automatic numbering for chunk tag variables
We use stuff like
<lit-tag v="4"/> in the
<lu>'s in t1x to say that this tag should be assigned the fourth chunk tag after postchunk, eg.
<ind>. This is a handy feature which lightens the load on postchunk, but it hasn't been utilised to the fullest…
We currently have to use a variable to keep that number instead of just inserting the lit-tag directly in the lexical unit, since we don't know if it'll be the fourth or the fifth chunk tag or what; some tags may be empty (the Qst chunk tag is only there if the word has the question particle on it; superlative adjectives have no number/gender, etc.). The variable has to be set in each rule, and sometimes we have several chunk tags which may be empty or not at once (an adjective may be superlative or not, and have a question particle or not).
<chunk-pos part="art"/> would insert
<4> if a member of the def-attr "art" appeared at place 4, etc., no tag if no member of "art" appears in the chunk tags. This number would have to be computed at run-time, but that's the way it is anyway with our variable; and this tag would make the transfer files a lot more explicit and cleaner (the first time I saw
<lit-tag v="4"/> a little part of me died).
This was a problem:
$ echo 'guovvamánu 17 .' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin ^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$ ^17/17<Num><Sg><Nom><@HNOUN><@APP-N←><@SUBJ><@SPRED>$ ^./.<CLB>$
since we run cg-proc after apertium-tagger. The second run of cg-proc bails out on seeing several mapping tags here:
$ echo 'guovvamánu 17 .' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin |apertium-tagger -p -g sme-nob.prob |cg-proc -w -n sme-nob.lex.bin Error: addTagToReading() cannot add a mapping tag to a reading which already is mapped!
Solution: keep readings with different mapping tags separate in the first cg-proc run
$ echo 'guovvamánu 17 .' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin ^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$ ^17/17<Num><Sg><Nom><@HNOUN>/17<Num><Sg><Nom><@APP-N←>/17<Num><Sg><Nom><@SUBJ>/17<Num><Sg><Nom><@SPRED>$ ^./.<CLB>$
(this is actually how things work internally in vislcg3, but before output, regular vislcg3 merges mapping tags -- we override that in cg-proc)
Headlines vs apertium-destxt
apertium-destxt adds an extra period if we have an empty line below:
$ echo 'foo > > bar. > fie. > > foe'|apertium-destxt foo.[ ]bar.[ ]fie..[ ]foe.[ ]
Could we make the formatter add something else instead? Then we could tag it as a headline. As it is, we get double periods at the end of lines ending with a period and followed by empty lines, which messes up CG since the rules think this means an ellipsis. If we instead got eg.
foo¶[ ]bar.[ ]fie.¶[ ]foe¶[ ]
we could tag ¶ as something like
<sent><headline>, and CG would have a chance to expect headline language.
Whether the dot is added or not in any particular place, depends on the format handler; apertium-deshtml adds a dot if we see a <br> followed by empty lines (but curiously not if we see the correct xhtml <br/>). However, the fact that it's a dot that's added instead of something else, is hardcoded in
deformat.xsl, so all formats that say "this marks an end of sentence" will use a dot as an eos-marker.
Ideally, we should be able to give an argument to
apertium-desfoo (for all values of
foo) that specifies the eos-marker, since different language pairs (or modes) might want different eos-markers.