Difference between revisions of "North Saami and Lule Saami"
Jump to navigation
Jump to search
Line 18: | Line 18: | ||
:::This comes from the CG tag @SUBJ> |
:::This comes from the CG tag @SUBJ> |
||
* Re-train the HMM-based POS tagger on a Sami corpus. |
* Re-train the HMM-based POS tagger on a Sami corpus. |
||
+ | |||
+ | |||
+ | ==Testing== |
||
+ | |||
+ | ;Analysing some Northern Sami text: |
||
+ | |||
+ | <pre> |
||
+ | $ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ |
||
+ | gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin |
||
+ | |||
+ | ^Wikipedia/Wikipedia<N><Prop><Sg><Nom>/Wikipedia<N><Prop><Sg><Gen>/Wikipedia<N><Prop><Sg><Acc>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ |
||
+ | ^máŋggagielat/*máŋggagielat$ ^prošeakta/prošeakta<N><Sg><Nom>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ |
||
+ | ^ráhkadit/ráhkadit<V><TV><Inf>/ráhkadit<V><TV><Ind><Prs><Pl3>/ráhkadit<V><TV><Ind><Prt><Sg2>$ ^almmolaš/almmolaš<A><Attr>/almmolaš<A><Sg><Nom>$ |
||
+ | ^diehtosátnegirjji/*diehtosátnegirjji$ |
||
+ | ^gosa/gosa<ADV>/gossat<V><IV><VGen>/gossat<V><IV><Imprt><Prs><ConNeg>/gossat<V><IV><Imprt><Prs><Sg2>/gossat<V><IV><Ind><Prs><ConNeg>$ ^gii/*gii$ |
||
+ | ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3>$ ^čállit/čállit<V><TV><Inf>/čállit<V><TV><Ind><Prs><Pl1>$ ^artihkkaliid/*artihkkaliid$. |
||
+ | </pre> |
||
+ | |||
+ | ;Disambiguating text with Constraint grammar: |
||
+ | |||
+ | <pre> |
||
+ | $ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ |
||
+ | gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin |
||
+ | |||
+ | ^Wikipedia/Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat/*máŋggagielat$ |
||
+ | ^prošeakta/prošeakta<N><Sg><Nom><@<SPRED>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ |
||
+ | ^ráhkadit/ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš/almmolaš<A><Sg><Nom><@%SUBJ>$ ^diehtosátnegirjji/*diehtosátnegirjji$ ^gosa/gosa<ADV>$ ^gii/*gii$ |
||
+ | ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit/čállit<V><TV><Inf><@-FMAINV>$ ^artihkkaliid/*artihkkaliid$ |
||
+ | </pre> |
||
+ | |||
+ | ;Finishing off the disambiguation with Apertium's HMM tagger: |
||
+ | |||
+ | <pre> |
||
+ | $ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ |
||
+ | gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob |
||
+ | |||
+ | ^Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^*máŋggagielat$ ^prošeakta<N><Sg><Nom><@<SPRED>$ ^man<ADV>$ |
||
+ | ^*ulbmilin$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš<A><Sg><Nom><@%SUBJ>$ ^*diehtosátnegirjji$ |
||
+ | ^gosa<ADV>$ ^*gii$ ^beare<ADV>$ ^sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit<V><TV><Inf><@-FMAINV>$ ^*artihkkaliid$ |
||
+ | </pre> |
||
+ | |||
+ | ;Applying lexical transfer and chunking: |
||
+ | |||
+ | <pre> |
||
+ | $ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ |
||
+ | gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob | \ |
||
+ | apertium-transfer apertium-sme-smj.sme-smj.t1x sme-smj.t1x.bin sme-smj.autobil.bin |
||
+ | |||
+ | ^nom<SN><@SUBJ%><Sg><Nom>{^@Wikipedia<N><Prop><Sg><Nom>$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ ^unknown{^*máŋggagielat$}$ |
||
+ | ^nom<SN><Sg><Nom>{^@prošeakta<N><Sg><Nom>$}$ ^default{^<ADV>$}$ ^unknown{^*ulbmilin$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ |
||
+ | ^default{^@ráhkadit<V><TV><Inf><@-FMAINV>$}$ ^default{^@almmolaš<A><Sg><Nom><@%SUBJ>$}$ ^unknown{^*diehtosátnegirjji$}$ ^default{^@gosa<ADV>$}$ |
||
+ | ^unknown{^*gii$}$ ^default{^@beare<ADV>$}$ ^default{^@sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$}$ ^default{^@čállit<V><TV><Inf><@-FMAINV>$}$ |
||
+ | ^unknown{^*artihkkaliid$}$ |
||
+ | |||
+ | </pre> |
||
[[Category:Language pairs]] |
[[Category:Language pairs]] |
Revision as of 13:38, 6 October 2008
Files
apertium-sme-smj.sme.dix
— Northern Sami transducerapertium-sme-smj.sme-smj.dix
— Transfer lexiconapertium-sme-smj.smj.dix
— Lule Sami transducerapertium-sme-smj.sme-smj.rlx
— Constraint grammarapertium-sme-smj.sme-smj.t1x
— Transfer rule file (level 1 -- Local re-ordering, chunking)apertium-sme-smj.sme-smj.t2x
— Transfer rule file (level 2 -- Phrase and chunk re-ordering)apertium-sme-smj.sme-smj.t3x
— Transfer rule file (level 3 -- Final touches)
TODO
- Mapped tags in the CG use special characters in Apertium, for example '>' (used for delimiting tags) and '-'. These should be replaced somehow.
- Example:
^Wikipedia<N><Prop><Sg><Nom><@SUBJ>>$
- This comes from the CG tag @SUBJ>
- Example:
- Re-train the HMM-based POS tagger on a Sami corpus.
Testing
- Analysing some Northern Sami text
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin ^Wikipedia/Wikipedia<N><Prop><Sg><Nom>/Wikipedia<N><Prop><Sg><Gen>/Wikipedia<N><Prop><Sg><Acc>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ ^máŋggagielat/*máŋggagielat$ ^prošeakta/prošeakta<N><Sg><Nom>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ ^ráhkadit/ráhkadit<V><TV><Inf>/ráhkadit<V><TV><Ind><Prs><Pl3>/ráhkadit<V><TV><Ind><Prt><Sg2>$ ^almmolaš/almmolaš<A><Attr>/almmolaš<A><Sg><Nom>$ ^diehtosátnegirjji/*diehtosátnegirjji$ ^gosa/gosa<ADV>/gossat<V><IV><VGen>/gossat<V><IV><Imprt><Prs><ConNeg>/gossat<V><IV><Imprt><Prs><Sg2>/gossat<V><IV><Ind><Prs><ConNeg>$ ^gii/*gii$ ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3>$ ^čállit/čállit<V><TV><Inf>/čállit<V><TV><Ind><Prs><Pl1>$ ^artihkkaliid/*artihkkaliid$.
- Disambiguating text with Constraint grammar
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin ^Wikipedia/Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat/*máŋggagielat$ ^prošeakta/prošeakta<N><Sg><Nom><@<SPRED>$ ^man/man<ADV>$ ^ulbmilin/*ulbmilin$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit/ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš/almmolaš<A><Sg><Nom><@%SUBJ>$ ^diehtosátnegirjji/*diehtosátnegirjji$ ^gosa/gosa<ADV>$ ^gii/*gii$ ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit/čállit<V><TV><Inf><@-FMAINV>$ ^artihkkaliid/*artihkkaliid$
- Finishing off the disambiguation with Apertium's HMM tagger
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob ^Wikipedia<N><Prop><Sg><Nom><@SUBJ%>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^*máŋggagielat$ ^prošeakta<N><Sg><Nom><@<SPRED>$ ^man<ADV>$ ^*ulbmilin$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš<A><Sg><Nom><@%SUBJ>$ ^*diehtosátnegirjji$ ^gosa<ADV>$ ^*gii$ ^beare<ADV>$ ^sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit<V><TV><Inf><@-FMAINV>$ ^*artihkkaliid$
- Applying lexical transfer and chunking
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob | \ apertium-transfer apertium-sme-smj.sme-smj.t1x sme-smj.t1x.bin sme-smj.autobil.bin ^nom<SN><@SUBJ%><Sg><Nom>{^@Wikipedia<N><Prop><Sg><Nom>$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ ^unknown{^*máŋggagielat$}$ ^nom<SN><Sg><Nom>{^@prošeakta<N><Sg><Nom>$}$ ^default{^<ADV>$}$ ^unknown{^*ulbmilin$}$ ^default{^@leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$}$ ^default{^@ráhkadit<V><TV><Inf><@-FMAINV>$}$ ^default{^@almmolaš<A><Sg><Nom><@%SUBJ>$}$ ^unknown{^*diehtosátnegirjji$}$ ^default{^@gosa<ADV>$}$ ^unknown{^*gii$}$ ^default{^@beare<ADV>$}$ ^default{^@sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$}$ ^default{^@čállit<V><TV><Inf><@-FMAINV>$}$ ^unknown{^*artihkkaliid$}$