Difference between revisions of "North Saami and Lule Saami"

From Apertium
Jump to navigation Jump to search
Line 18: Line 18:
 
:::This comes from the CG tag @SUBJ>
 
:::This comes from the CG tag @SUBJ>
 
* Re-train the HMM-based POS tagger on a Sami corpus.
 
* Re-train the HMM-based POS tagger on a Sami corpus.
  +
  +
  +
==Testing==
  +
  +
;Analysing some Northern Sami text:
  +
  +
<pre>
  +
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
  +
gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin
  +
  +
^Wikipedia/Wikipedia<N><Prop><Sg><Nom>/Wikipedia<N><Prop><Sg><Gen>/Wikipedia<N><Prop><Sg><Acc>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$
  +
^máŋggagielat/*máŋggagielat$ ^prošeakta/prošeakta<N><Sg><Nom>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3>$
  +
^ráhkadit/ráhkadit<V><TV><Inf>/ráhkadit<V><TV><Ind><Prs><Pl3>/ráhkadit<V><TV><Ind><Prt><Sg2>$ ^almmolaš/almmolaš<A><Attr>/almmolaš<A><Sg><Nom>$
  +
^diehtosátnegirjji/*diehtosátnegirjji$
  +
^gosa/gosa<ADV>/gossat<V><IV><VGen>/gossat<V><IV><Imprt><Prs><ConNeg>/gossat<V><IV><Imprt><Prs><Sg2>/gossat<V><IV><Ind><Prs><ConNeg>$ ^gii/*gii$
  +
^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3>$ ^čállit/čállit<V><TV><Inf>/čállit<V><TV><Ind><Prs><Pl1>$ ^artihkkaliid/*artihkkaliid$.
  +
</pre>
  +
  +
;Disambiguating text with Constraint grammar:
  +
  +
<pre>
  +
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
  +
gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin
  +
  +
^Wikipedia/Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat/*máŋggagielat$
  +
^prošeakta/prošeakta<N><Sg><Nom><@<SPRED>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$
  +
^ráhkadit/ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš/almmolaš<A><Sg><Nom><@%SUBJ>$ ^diehtosátnegirjji/*diehtosátnegirjji$ ^gosa/gosa<ADV>$ ^gii/*gii$
  +
^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit/čállit<V><TV><Inf><@-FMAINV>$ ^artihkkaliid/*artihkkaliid$
  +
</pre>
  +
  +
;Finishing off the disambiguation with Apertium's HMM tagger:
  +
  +
<pre>
  +
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
  +
gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob
  +
  +
^Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^*máŋggagielat$ ^prošeakta<N><Sg><Nom><@<SPRED>$ ^man<ADV>$
  +
^*ulbmilin$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš<A><Sg><Nom><@%SUBJ>$ ^*diehtosátnegirjji$
  +
^gosa<ADV>$ ^*gii$ ^beare<ADV>$ ^sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit<V><TV><Inf><@-FMAINV>$ ^*artihkkaliid$
  +
</pre>
  +
  +
;Applying lexical transfer and chunking:
  +
  +
<pre>
  +
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
  +
gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob | \
  +
apertium-transfer apertium-sme-smj.sme-smj.t1x sme-smj.t1x.bin sme-smj.autobil.bin
  +
  +
^nom<SN><@SUBJ%><Sg><Nom>{^@Wikipedia<N><Prop><Sg><Nom>$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ ^unknown{^*máŋggagielat$}$
  +
^nom<SN><Sg><Nom>{^@prošeakta<N><Sg><Nom>$}$ ^default{^<ADV>$}$ ^unknown{^*ulbmilin$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$
  +
^default{^@ráhkadit<V><TV><Inf><@-FMAINV>$}$ ^default{^@almmolaš<A><Sg><Nom><@%SUBJ>$}$ ^unknown{^*diehtosátnegirjji$}$ ^default{^@gosa<ADV>$}$
  +
^unknown{^*gii$}$ ^default{^@beare<ADV>$}$ ^default{^@sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$}$ ^default{^@čállit<V><TV><Inf><@-FMAINV>$}$
  +
^unknown{^*artihkkaliid$}$
  +
  +
</pre>
   
 
[[Category:Language pairs]]
 
[[Category:Language pairs]]

Revision as of 13:38, 6 October 2008

Files

  • apertium-sme-smj.sme.dix — Northern Sami transducer
  • apertium-sme-smj.sme-smj.dix — Transfer lexicon
  • apertium-sme-smj.smj.dix — Lule Sami transducer
  • apertium-sme-smj.sme-smj.rlx — Constraint grammar
  • apertium-sme-smj.sme-smj.t1x — Transfer rule file (level 1 -- Local re-ordering, chunking)
  • apertium-sme-smj.sme-smj.t2x — Transfer rule file (level 2 -- Phrase and chunk re-ordering)
  • apertium-sme-smj.sme-smj.t3x — Transfer rule file (level 3 -- Final touches)

TODO

  • Mapped tags in the CG use special characters in Apertium, for example '>' (used for delimiting tags) and '-'. These should be replaced somehow.
Example:
^Wikipedia<N><Prop><Sg><Nom><@SUBJ>>$
This comes from the CG tag @SUBJ>
  • Re-train the HMM-based POS tagger on a Sami corpus.


Testing

Analysing some Northern Sami text
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
gii beare sáhttá čállit artihkkaliid." |  lt-proc sme-smj.automorf.bin

^Wikipedia/Wikipedia<N><Prop><Sg><Nom>/Wikipedia<N><Prop><Sg><Gen>/Wikipedia<N><Prop><Sg><Acc>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ 
^máŋggagielat/*máŋggagielat$ ^prošeakta/prošeakta<N><Sg><Nom>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ 
^ráhkadit/ráhkadit<V><TV><Inf>/ráhkadit<V><TV><Ind><Prs><Pl3>/ráhkadit<V><TV><Ind><Prt><Sg2>$ ^almmolaš/almmolaš<A><Attr>/almmolaš<A><Sg><Nom>$ 
^diehtosátnegirjji/*diehtosátnegirjji$ 
^gosa/gosa<ADV>/gossat<V><IV><VGen>/gossat<V><IV><Imprt><Prs><ConNeg>/gossat<V><IV><Imprt><Prs><Sg2>/gossat<V><IV><Ind><Prs><ConNeg>$ ^gii/*gii$ 
^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3>$ ^čállit/čállit<V><TV><Inf>/čállit<V><TV><Ind><Prs><Pl1>$ ^artihkkaliid/*artihkkaliid$.
Disambiguating text with Constraint grammar
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ 
gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin 

^Wikipedia/Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat/*máŋggagielat$ 
^prošeakta/prošeakta<N><Sg><Nom><@<SPRED>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ 
^ráhkadit/ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš/almmolaš<A><Sg><Nom><@%SUBJ>$ ^diehtosátnegirjji/*diehtosátnegirjji$ ^gosa/gosa<ADV>$ ^gii/*gii$ 
^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit/čállit<V><TV><Inf><@-FMAINV>$ ^artihkkaliid/*artihkkaliid$
Finishing off the disambiguation with Apertium's HMM tagger
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ 
gii beare sáhttá čállit artihkkaliid." |  lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob 

^Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^*máŋggagielat$ ^prošeakta<N><Sg><Nom><@<SPRED>$ ^man<ADV>$ 
^*ulbmilin$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš<A><Sg><Nom><@%SUBJ>$ ^*diehtosátnegirjji$ 
^gosa<ADV>$ ^*gii$ ^beare<ADV>$ ^sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit<V><TV><Inf><@-FMAINV>$ ^*artihkkaliid$
Applying lexical transfer and chunking
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
gii beare sáhttá čállit artihkkaliid." |  lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob | \
apertium-transfer apertium-sme-smj.sme-smj.t1x sme-smj.t1x.bin sme-smj.autobil.bin

^nom<SN><@SUBJ%><Sg><Nom>{^@Wikipedia<N><Prop><Sg><Nom>$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ ^unknown{^*máŋggagielat$}$ 
^nom<SN><Sg><Nom>{^@prošeakta<N><Sg><Nom>$}$ ^default{^<ADV>$}$ ^unknown{^*ulbmilin$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ 
^default{^@ráhkadit<V><TV><Inf><@-FMAINV>$}$ ^default{^@almmolaš<A><Sg><Nom><@%SUBJ>$}$ ^unknown{^*diehtosátnegirjji$}$ ^default{^@gosa<ADV>$}$ 
^unknown{^*gii$}$ ^default{^@beare<ADV>$}$ ^default{^@sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$}$ ^default{^@čállit<V><TV><Inf><@-FMAINV>$}$ 
^unknown{^*artihkkaliid$}$