Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
+
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
   
 
==Description==
 
==Description==
The sme morphological analyser is a ''trimmed'' version of the one in Giellatekno. It's all contained in the files
 
   
 
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
* apertium-sme-nob.sme.lexc (lexicon)
 
   
  +
===Trimming===
* apertium-sme-nob.sme.twol (two-level morphology)
 
   
  +
Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]].
The twol file is a plain copy of twol-sme.txt. The lexc file is a concatenation of sme-lex.txt and the various POS-sme-lex.txt files in gt/sme/src (e.g. verb-sme-lex.txt, adj-sme-lex.txt). However, for each of those POS-files, the apertium lexc file only contains lines where the lemma exists in bidix.
 
   
  +
===Tagset changes===
We keep the lexc file up to date with the bidix and the giellatekno entries with the python script update-morph/update-lexc.py and a configuration file based on update-morph/langs.cfg.in. The configuration file tells which -lex.txt source files are to be plain copied, and which are to be trimmed, and any POS tags to restrict the trimming to. For trimming, it loads the compiled bidix FST (sme-nob.autobil.bin), and, for each of the lines in the files that are to be trimmed, it checks if the lemma (plus possible POS tags) is possible to analyse with the FST. So if noun-sme-lex.txt has
 
<pre>
 
beron GAHPIR ;
 
beroštupmi:berošt UPMI ;
 
beroštus#riidu:beroštus#rij'du ALBMI ;
 
beroštus#vuostálasvuohta+CmpN/SgG+CmpN/DefPlGen:beroštus#vuostálasvuoh'ta LUONDU ;
 
</pre>
 
and the config says to append <code><N></code> when trimming nouns, it will try sending <code>^beron<N>$ ^beroštupmi<N>$ ^beroštusvuostálasvuohta<N>$</code> through sme-nob.autobil.bin, and if beron gave a match, that line will be included, if beroštupmi didn't, it'll be excluded, etc.
 
   
  +
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
   
  +
* We remove Err/Orth/Usage tags (+Use/Sub, etc) –
{{TOCD}}
 
  +
* We remove any derivational analyses that aren't yet handled by transfer/bidix
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex
  +
** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
  +
* We reorder some tags
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
  +
* We change the format of tags, so +N becomes &lt;n&gt;, +Der/PassL becomes &lt;der_passl&gt;, etc.
  +
* We change certain tags
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex
   
==Misc==
+
==TODO==
  +
===Misc===
 
* add entries from bidix that are missing from the analyser
 
* add entries from bidix that are missing from the analyser
** Missing [http://codepad.org/Dusebd68 nouns], [http://apertium.codepad.org/6Kr6H7RO adverbs], [http://codepad.org/7Hadok6S adjectives]
 
 
 
 
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
 
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
 
 
 
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
 
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
 
 
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
   
==Typos==
+
===Typos===
 
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
 
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
   
 
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
 
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
   
==Dashes==
+
===Dashes===
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in dev/xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag &lt;dash&gt;), but perhaps we could add a tag there …
+
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
 
==Compounding==
 
===ensure compounding is only tried if there is no other solution===
 
Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
 
 
However, we have a CG rule that removes compounds if there are other readings, so we're OK for now.
 
   
==Multiwords==
+
===Multiwords===
 
Add simple multiwords and fixed expressions to the analyser.
 
Add simple multiwords and fixed expressions to the analyser.
   

Latest revision as of 09:46, 15 April 2015

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description[edit]

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming[edit]

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes[edit]

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

TODO[edit]

Misc[edit]

  • add entries from bidix that are missing from the analyser
  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos[edit]

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Dashes[edit]

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords[edit]

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.