Difference between revisions of "North Saami and Lule Saami"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
|  (→TODO) | |||
| Line 13: | Line 13: | ||
| ==TODO== | ==TODO== | ||
| ===Tagset mismatches==== | |||
| * <s>Mapped tags in the CG use special characters in Apertium, for example '>' (used for delimiting tags) and '-' (causes problems with pretransfer). These should be replaced somehow. | |||
| ::Example: | |||
| ; eará -- ietjá | |||
| :::<code>^Wikipedia<N><Prop><Sg><Nom><@SUBJ>>$</code> or <code>^prošeakta<N><Sg><Nom><@<SPRED>$</code> | |||
| :::This comes from the CG tag @SUBJ></s> | |||
| <pre> | |||
| ::Replaced > with → and < with ←  | |||
| $ echo "eará" | osme | |||
| * Re-train the HMM-based POS tagger on a Sami corpus. | |||
| 191480 0 | |||
| * Closed categories in sme analyser | |||
| eará	eará+Pron+Indef+Sg+Nom | |||
| eará	eará+Pron+Indef+Sg+Gen | |||
| eará	eará+Pron+Indef+Sg+Acc | |||
| eará	eará+Pron+Indef+Attr | |||
| $ echo "ietjá+Pron+Indef+Attr" | dsmj | |||
| 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% | |||
| ietjá+Pron+Indef+Attr	ietjá+Pron+Indef+Attr	+? | |||
| </pre> | |||
| ;buot -- divnna | |||
| <pre> | |||
| $ echo "buot" | osme | |||
| 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% | |||
| buot	buot+Adv | |||
| buot	buot+Pron+Indef | |||
| $ echo "divnna" | osmj | |||
| 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% | |||
| divnna	divnna+Pron+Indef+Sg+Nom | |||
| divnna	divnna+Pron+Indef+Attr | |||
| </pre> | |||
| ==Reminders== | ==Reminders== | ||
Revision as of 23:15, 16 June 2010
Files
- apertium-sme-smj.sme.dix— Northern Sami transducer
- apertium-sme-smj.sme-smj.dix— Transfer lexicon
- apertium-sme-smj.smj.dix— Lule Sami transducer
- apertium-sme-smj.sme-smj.rlx— Constraint grammar
- apertium-sme-smj.sme-smj.t1x— Transfer rule file (level 1 -- Local re-ordering, chunking)
- apertium-sme-smj.sme-smj.t2x— Transfer rule file (level 2 -- Phrase and chunk re-ordering)
- apertium-sme-smj.sme-smj.t3x— Transfer rule file (level 3 -- Final touches)
TODO
Tagset mismatches=
- eará -- ietjá
$ echo "eará" | osme 191480 0 eará eará+Pron+Indef+Sg+Nom eará eará+Pron+Indef+Sg+Gen eará eará+Pron+Indef+Sg+Acc eará eará+Pron+Indef+Attr $ echo "ietjá+Pron+Indef+Attr" | dsmj 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% ietjá+Pron+Indef+Attr ietjá+Pron+Indef+Attr +?
- buot -- divnna
$ echo "buot" | osme 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% buot buot+Adv buot buot+Pron+Indef $ echo "divnna" | osmj 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% divnna divnna+Pron+Indef+Sg+Nom divnna divnna+Pron+Indef+Attr
Reminders
- In the transfer rule files, don't forget to escape the '+' character in tags, for example:
- no: <attr-item tags="@+FMAINV"/>,
- yes: <attr-item tags="@\+FMAINV"/>
 
- no: 
Testing
- Analysing some Northern Sami text
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin ^Wikipedia/Wikipedia<N><Prop><Sg><Nom>/Wikipedia<N><Prop><Sg><Gen>/Wikipedia<N><Prop><Sg><Acc>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ ^máŋggagielat/máŋggagielat<A><Attr>/máŋggagielat<A><Sg><Nom>$ ^prošeakta/prošeakta<N><Sg><Nom>$ ^man/man<ADV>/mii<Pron><Interr><Sg><Gen>/mii<Pron><Interr><Sg><Acc>/mii<Pron><Rel><Sg><Gen>/mii<Pron><Rel><Sg><Acc>$ ^ulbmilin/ulbmil<N><Ess>$ ^lea/leat<V><IV><Ind><Prs><Sg3>$ ^ráhkadit/ráhkadit<V><TV><Inf>/ráhkadit<V><TV><Ind><Prs><Pl3>/ráhkadit<V><TV><Ind><Prt><Sg2>$ ^almmolaš/almmolaš<A><Attr>/almmolaš<A><Sg><Nom>$ ^diehtosátnegirjji/diehtosátnegirji<N><Sg><Acc>$ ^gosa/gosa<ADV>/gossat<V><IV><VGen>/gossat<V><IV><Imprt><Prs><ConNeg>/gossat<V><IV><Imprt><Prs><Sg2>/gossat<V><IV><Ind><Prs><ConNeg>$ ^gii/gii<Pron><Interr><Sg><Nom>/gii<Pron><Rel><Sg><Nom>$ ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3>$ ^čállit/čállit<V><TV><Inf>/čállit<V><TV><Ind><Prs><Pl1>$ ^artihkkaliid/artihkal<N><Pl><Gen>/artihkal<N><Pl><Acc>$.
- Disambiguating and annotating text with Constraint grammar
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin ^Wikipedia/Wikipedia<N><Prop><Sg><Nom><@SUBJ→>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat/máŋggagielat<A><Attr><@→N>$ ^prošeakta/prošeakta<N><Sg><Nom><@←SPRED>$ ^man/mii<Pron><Rel><Sg><Gen><@→N>$ ^ulbmilin/ulbmil<N><Ess><@SPRED→>$ ^lea/leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit/ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš/almmolaš<A><Attr><@→N>$ ^diehtosátnegirjji/diehtosátnegirji<N><Sg><Acc><@←OBJ>$ ^gosa/gosa<ADV>$ ^gii/gii<Pron><Rel><Sg><Nom><@SUBJ→>$ ^beare/beare<ADV>$ ^sáhttá/sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit/čállit<V><TV><Inf><@-FMAINV>$ ^artihkkaliid/artihkal<N><Pl><Acc><@←OBJ>$.
- Finishing off the disambiguation with Apertium's HMM tagger
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob ^Wikipedia<N><Prop><Sg><Nom><@SUBJ→>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^máŋggagielat<A><Attr><@→N>$ ^prošeakta<N><Sg><Nom><@←SPRED>$ ^mii<Pron><Rel><Sg><Gen><@→N>$ ^ulbmil<N><Ess><@SPRED→>$ ^leat<V><IV><Ind><Prs><Sg3><@+FMAINV>$ ^ráhkadit<V><TV><Inf><@-FMAINV>$ ^almmolaš<A><Attr><@→N>$ ^diehtosátnegirji<N><Sg><Acc><@←OBJ>$ ^gosa<ADV>$ ^gii<Pron><Rel><Sg><Nom><@SUBJ→>$ ^beare<ADV>$ ^sáhttit<V><IV><Ind><Prs><Sg3><@+FAUXV>$ ^čállit<V><TV><Inf><@-FMAINV>$ ^artihkal<N><Pl><Acc><@←OBJ>$.
- Applying lexical transfer and chunking
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \
gii beare sáhttá čállit artihkkaliid." |  lt-proc sme-smj.automorf.bin | cg-proc sme-smj.rlx.bin | apertium-tagger -g sme-smj.prob | \
apertium-transfer apertium-sme-smj.sme-smj.t1x sme-smj.t1x.bin sme-smj.autobil.bin
^nom<SN><@SUBJ→><Sg><Nom>{^Wikipedia<N><Prop><Sg><Nom>$}$ ^verb<SV><@+FMAINV>{^liehket<V><Ind><Prs><Sg3>$}$ ^nom<SN><@→N>{^@máŋggagielat<A><Attr>$}$ 
^nom<SN><Sg><Nom>{^prosjækta<N><Sg><Nom>$}$ ^pronom<SN><@→N><Sg><Gen>{^mij<Pron><Rel><Sg><Gen>$}$ ^nom<SN><@SPRED→><Ess>{^ulmme<N><Ess>$}$ 
^verb<SV><@+FMAINV>{^liehket<V><Ind><Prs><Sg3>$}$ ^verb<SV><@-FMAINV>{^dahkat<V><Inf>$}$ ^nom<SN><@→N>{^almulasj<A><Attr>$}$ 
^nom<SN><@←OBJ><Sg><Acc>{^@diehtosátnegirji<N><Sg><Acc>$}$ ^adv<Adv>{^ADV><ADV>$}$ ^pronom<SN><@SUBJ→><Sg><Nom>{^guhti<Pron><Rel><Sg><Nom>$}$ 
^adv<Adv>{^beru<ADV>$}$ ^verb<SV>{^sáhttet<V><Ind><Prs><Sg3>$}$ ^verb<SV><@-FMAINV>{^tjállet<V><Inf>$}$ 
^nom<SN><@←OBJ><Pl><Acc>{^artihkal<N><Pl><Acc>$}$.
- Running through the whole system
$ echo "Wikipedia lea máŋggagielat prošeakta man ulbmilin lea ráhkadit almmolaš diehtosátnegirjji gosa \ gii beare sáhttá čállit artihkkaliid." | apertium -d . sme-smj Wikipedia l @máŋggagielat prosjækta man ulmmen l dahkat almulasj @diehtosátnegirji #ADV> guhti beru sáhttá tjállet artihkkalijt

