Difference between revisions of "Northern Sámi and Norwegian/smemorf"
(31 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | apertium-sme-nob |
+ | Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno. |
==Description== |
==Description== |
||
− | The sme morphological analyser is a trimmed version of the one in Giellatekno. It's all contained in the files apertium-sme-nob.sme.lexc (lexicon) apertium-sme-nob.sme.twol (two-level morphology). The twol file is a plain copy of twol-sme.txt. The lexc file is a concatenation of sme-lex.txt and the various POS-sme-lex.txt files in gt/sme/src (e.g. verb-sme-lex.txt, adj-sme-lex.txt). However, for each of those POS-files, the apertium lexc file only contains lines where the lemma exists in bidix. |
||
+ | The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno. |
||
− | We keep the lexc file up to date with the bidix and the giellatekno entries with the python script update-morph/update-lexc.py. |
||
+ | ===Trimming=== |
||
+ | Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]]. |
||
− | {{TOCD}} |
||
+ | ===Tagset changes=== |
||
⚫ | |||
⚫ | |||
− | ** Missing [http://codepad.org/Dusebd68 nouns], [http://apertium.codepad.org/6Kr6H7RO adverbs], [http://codepad.org/7Hadok6S adjectives] |
||
+ | The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno. |
||
+ | * We remove Err/Orth/Usage tags (+Use/Sub, etc) – |
||
+ | * We remove any derivational analyses that aren't yet handled by transfer/bidix |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex |
||
+ | ** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]] |
||
+ | * We reorder some tags |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex |
||
+ | * We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc. |
||
+ | * We change certain tags |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel |
||
+ | ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex |
||
+ | |||
+ | ==TODO== |
||
⚫ | |||
⚫ | |||
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå) |
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå) |
||
− | |||
− | |||
* regex for acronyms like "GsoC:as" (tokenisation dependent...) |
* regex for acronyms like "GsoC:as" (tokenisation dependent...) |
||
− | |||
− | |||
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated |
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated |
||
− | ==Typos== |
+ | ===Typos=== |
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…) |
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…) |
||
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis |
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis |
||
− | ==Dashes== |
+ | ===Dashes=== |
− | lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). |
+ | lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). |
− | |||
− | ==Compounding== |
||
− | ===ensure compounding is only tried if there is no other solution=== |
||
− | Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight. |
||
− | |||
− | However, we have a CG rule that removes compounds if there are other readings, so we're OK for now. |
||
− | ==Multiwords== |
+ | ===Multiwords=== |
Add simple multiwords and fixed expressions to the analyser. |
Add simple multiwords and fixed expressions to the analyser. |
||
Latest revision as of 09:46, 15 April 2015
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
Contents
Description[edit]
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
Trimming[edit]
Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.
Tagset changes[edit]
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
- We remove Err/Orth/Usage tags (+Use/Sub, etc) –
- We remove any derivational analyses that aren't yet handled by transfer/bidix
- We reorder some tags
- We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
- We change certain tags
TODO[edit]
Misc[edit]
- add entries from bidix that are missing from the analyser
- regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));
-- this should be analysed as both, and disambiguated
Typos[edit]
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)
- a list of high-frequency typos where the correction has an analysis
Dashes[edit]
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
Multiwords[edit]
Add simple multiwords and fixed expressions to the analyser.
- dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
- Buorre beaivi => God dag
- leat guollebivddus => å fiske
- maid ban dainna => hva i all verden
- jagis jahkái => fra år til år
- oaidnaleapmái => 'see you'
- ovdamearkka => for eksempel
- Mo manná? => Hvordan går det?
- ja nu ain => og så videre
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.