Difference between revisions of "Talk:Northern Sámi and Norwegian"

From Apertium
Jump to navigation Jump to search
 
(17 intermediate revisions by 2 users not shown)
Line 54: Line 54:


== Wishlist / Difficulties with the architecture / Ugly hacks ==
== Wishlist / Difficulties with the architecture / Ugly hacks ==
=== Clipping a substring in transfer (or any better solution) ===
For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context.


Having the preposition in a tag is rather ugly.
For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context. Having the preposition in a tag is rather ugly, so we ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.

We ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.

===No UTF in sdefs ===
@←SPRED is not a valid sdef. Is this just because of it being an XML ID?

{{comment|:I think you can have UTF-8 in sdefs, but they cannot start with a non alphabetic character. - [[User:Francis Tyers|Francis Tyers]] 13:34, 10 February 2010 (UTC)}}

===Automatic numbering for chunk tag variables===
We use stuff like <code><lit-tag v="4"/></code> in the <code><lu></code>'s in t1x to say that this tag should be assigned the fourth ''chunk tag'' after postchunk, eg. <code><ind></code>. This is a handy feature which lightens the load on postchunk, but it hasn't been utilised to the fullest…

We currently have to use a variable to keep that number instead of just inserting the lit-tag directly in the lexical unit, since we don't know if it'll be the fourth or the fifth chunk tag or what; some tags may be empty (the Qst chunk tag is only there if the word has the question particle on it; superlative adjectives have no number/gender, etc.). The variable has to be set in each rule, and sometimes we have several chunk tags which may be empty or not at once (an adjective may be superlative or not, and have a question particle or not).

Idea: <code><chunk-pos part="art"/></code> would insert <code><4></code> if a member of the def-attr "art" appeared at place 4, etc., no tag if no member of "art" appears in the chunk tags. This number would have to be computed at run-time, but that's the way it is anyway with our variable; and this tag would make the transfer files a lot more explicit and cleaner (the first time I saw <code><lit-tag v="4"/></code> a little part of me died).

===[fixed] Multiple mapping tags===
This was a problem:
<pre>
$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN><@APP-N←><@SUBJ><@SPRED>$
^./.<CLB>$
</pre>
since we run cg-proc after apertium-tagger. The second run of cg-proc bails out on seeing several mapping tags here:
<pre>
$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin |apertium-tagger -p -g sme-nob.prob |cg-proc -w -n sme-nob.lex.bin
Error: addTagToReading() cannot add a mapping tag to a reading which already is mapped!
</pre>

Solution: keep readings with different mapping tags separate in the first cg-proc run
<pre>
$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN>/17<Num><Sg><Nom><@APP-N←>/17<Num><Sg><Nom><@SUBJ>/17<Num><Sg><Nom><@SPRED>$
^./.<CLB>$
</pre>
(this is actually how things work internally in vislcg3, but before output, regular vislcg3 merges mapping tags -- we override that in cg-proc)

===Headlines vs apertium-destxt===
[[apertium-destxt]] adds an extra period if we have an empty line below:
<pre>
$ echo 'foo
>
> bar.
> fie.
>
> foe'|apertium-destxt
foo.[][

]bar.[
]fie..[][

]foe.[][
]</pre>

Could we make the formatter add something else instead? Then we could tag it as a headline. As it is, we get double periods at the end of lines ending with a period and followed by empty lines, which messes up CG since the rules think this means an ellipsis. If we instead got eg.
<pre>
foo¶[][

]bar.[
]fie.¶[][

]foe¶[][
]</pre>
we could tag ¶ as something like <code><sent><headline></code>, and CG would have a chance to expect headline language.

Whether the dot is added or not in any particular place, depends on the format handler; apertium-deshtml adds a dot if we see a &lt;br&gt; followed by empty lines (but curiously not if we see the correct xhtml &lt;br/&gt;). However, the fact that it's a ''dot'' that's added instead of something else, is hardcoded in <code>deformat.xsl</code>, so all formats that say "this marks an end of sentence" will use a dot as an eos-marker.

Ideally, we should be able to give an argument to <code>apertium-desfoo</code> (for all values of <code>foo</code>) that specifies the eos-marker, since different language pairs (or modes) might want different eos-markers.

Latest revision as of 08:51, 8 June 2010

Transfer strategy[edit]

So far I've been thinking this:

  • t1x: chunking
    • Turn adjectives and nouns into SN chunks, give them the right gender and number
    • Derivations into phrases?
  • t2x: movement
    • Put adpositions in front of SN chunks
    • In general move SN chunks around verbs, adverbs etc. to get right word order
    • Guess definiteness from word order, case, syntactic function
  • t3x: cleanup
    • Eg. if definiteness changed, make sure adj tags are consistent


We could also do:
  • t1x: light chunking (SN, ...)
  • t2x: more chunking (Relatives, subordinate clauses)
  • t3x: moving around and stuff
  • t4x: cleanup.

- Francis Tyers 18:32, 18 January 2010 (UTC)

The 1-4 are different files, is that it? There are both easy and hard issues when it comes to phrases, this speaks in favour of 4. But the clear-cut criterion for light vs. heavy?Trondtr 12:26, 19 January 2010 (UTC).

We'll need rules to cover both compounding and derivation, this speaks for 4-stage (eg. each noun could be a compound, multiplying each noun rule by two--or more if we have longer compounds?). We need to figure out what phenomena go in what stage though.unhammer 13:09, 19 January 2010 (UTC)
  • t1x
    • (de-)compounding,
    • derivation,
    • simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
    • simple periphrastic verb combinations (verb, vaux pp, vaux inf)
  • t2x
    • relatives (SN "who" SV -> SN)
    • co-ordination (SN "and" SN -> SN)
    • genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
  • t3x
    • move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
    • V2? --unhammer 13:04, 20 January 2010 (UTC) +1 Francis Tyers
    • Insert dropped pronouns? (Or tags for them?)--unhammer 14:25, 20 January 2010 (UTC) +1 Francis Tyers
  • t4x
    • Insert prepositions.
    • Insert articles? --unhammer 13:32, 20 January 2010 (UTC)
    • Cleanup
- Francis Tyers 14:37, 19 January 2010 (UTC)


Level Description Test case
t1x (de-)compounding Politiijastašuvnna

Wishlist / Difficulties with the architecture / Ugly hacks[edit]

Clipping a substring in transfer (or any better solution)[edit]

For inserting prepositions, we first tried just adding them to the chunk name in t1x (adj_nom => til_adj_nom), reading them off in t4x. However, since there is no function in apertium-interchunkt to remove the first n letters of a string, we couldn't have a general method in t2x or t3x to eg. switch the preposition or remove it based on a larger context.

Having the preposition in a tag is rather ugly.

We ended up just adding it as a chunk -- however, this means that all t2x/t3x rules working on eg. SN now have to be duplicated for the possibility of PR SN too.

No UTF in sdefs[edit]

@←SPRED is not a valid sdef. Is this just because of it being an XML ID?

I think you can have UTF-8 in sdefs, but they cannot start with a non alphabetic character. - Francis Tyers 13:34, 10 February 2010 (UTC)

Automatic numbering for chunk tag variables[edit]

We use stuff like <lit-tag v="4"/> in the <lu>'s in t1x to say that this tag should be assigned the fourth chunk tag after postchunk, eg. <ind>. This is a handy feature which lightens the load on postchunk, but it hasn't been utilised to the fullest…

We currently have to use a variable to keep that number instead of just inserting the lit-tag directly in the lexical unit, since we don't know if it'll be the fourth or the fifth chunk tag or what; some tags may be empty (the Qst chunk tag is only there if the word has the question particle on it; superlative adjectives have no number/gender, etc.). The variable has to be set in each rule, and sometimes we have several chunk tags which may be empty or not at once (an adjective may be superlative or not, and have a question particle or not).

Idea: <chunk-pos part="art"/> would insert <4> if a member of the def-attr "art" appeared at place 4, etc., no tag if no member of "art" appears in the chunk tags. This number would have to be computed at run-time, but that's the way it is anyway with our variable; and this tag would make the transfer files a lot more explicit and cleaner (the first time I saw <lit-tag v="4"/> a little part of me died).

[fixed] Multiple mapping tags[edit]

This was a problem:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin 
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN><@APP-N←><@SUBJ><@SPRED>$
^./.<CLB>$

since we run cg-proc after apertium-tagger. The second run of cg-proc bails out on seeing several mapping tags here:

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin |apertium-tagger -p -g sme-nob.prob |cg-proc -w -n sme-nob.lex.bin
Error: addTagToReading() cannot add a mapping tag to a reading which already is mapped!

Solution: keep readings with different mapping tags separate in the first cg-proc run

$ echo 'guovvamánu
17
.' |hfst-lookup -f apertium sme-nob.automorf.hfst |cg-proc sme-nob.rlx.bin 
^guovvamánu/guovvamánnu<N><Sg><Acc><@←OBJ>$
^17/17<Num><Sg><Nom><@HNOUN>/17<Num><Sg><Nom><@APP-N←>/17<Num><Sg><Nom><@SUBJ>/17<Num><Sg><Nom><@SPRED>$
^./.<CLB>$

(this is actually how things work internally in vislcg3, but before output, regular vislcg3 merges mapping tags -- we override that in cg-proc)

Headlines vs apertium-destxt[edit]

apertium-destxt adds an extra period if we have an empty line below:

$ echo 'foo
> 
> bar.
> fie.
> 
> foe'|apertium-destxt
foo.[][

]bar.[
]fie..[][

]foe.[][
]

Could we make the formatter add something else instead? Then we could tag it as a headline. As it is, we get double periods at the end of lines ending with a period and followed by empty lines, which messes up CG since the rules think this means an ellipsis. If we instead got eg.

foo¶[][

]bar.[
]fie.¶[][

]foe¶[][
]

we could tag ¶ as something like <sent><headline>, and CG would have a chance to expect headline language.

Whether the dot is added or not in any particular place, depends on the format handler; apertium-deshtml adds a dot if we see a <br> followed by empty lines (but curiously not if we see the correct xhtml <br/>). However, the fact that it's a dot that's added instead of something else, is hardcoded in deformat.xsl, so all formats that say "this marks an end of sentence" will use a dot as an eos-marker.

Ideally, we should be able to give an argument to apertium-desfoo (for all values of foo) that specifies the eos-marker, since different language pairs (or modes) might want different eos-markers.