Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 14:12, 12 April 2012

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description

The twol file is a plain copy of twol-sme.txt. The lexc file is a concatenation of sme-lex.txt and the various POS-sme-lex.txt files in gt/sme/src (e.g. verb-sme-lex.txt, adj-sme-lex.txt). However, for each of those POS-files, the apertium lexc file only contains lines where the lemma exists in bidix.

In addition, there are a few two-level rules and similar that are composed on the analyser:

xfst2apertium.useless.twol
- This file is composed first, and removes Usage tags (+Use/Sub, etc), and removes any derivational analyses that aren't yet handled by transfer/bidix
- It also removes the - from split compound lemmas.
- More on derivations in sme-nob
xfst2apertium.hashtags.twol
- This file is composed second and removes the #-mark between those compounds that are lexicalised (non-dynamic)
- It also ensure the +G3 tag occurs after the +N tag
xfst2apertium.relabel
- This file is used with hfst-substitute to change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.

Updating the lexc when giellatekno/bidix changes

We keep the lexc file up to date with the bidix and the giellatekno entries with the python script update-morph/update-lexc.py and a per-user configuration file based on update-morph/langs.cfg.in. The configuration file tells which -lex.txt source files are to be plain copied, and which are to be trimmed, and any POS tags to restrict the trimming to. For trimming, it loads the compiled bidix FST (sme-nob.autobil.bin), and, for each of the lines in the files that are to be trimmed, it checks if the lemma (plus possible POS tags) is possible to analyse with the FST. So if noun-sme-lex.txt has

beron GAHPIR ;
beroštupmi:berošt UPMI ;
beroštus#riidu:beroštus#rij'du ALBMI ;
beroštus#vuostálasvuohta+CmpN/SgG+CmpN/DefPlGen:beroštus#vuostálasvuoh'ta LUONDU ;

and the config says to append <N> when trimming nouns, it will try sending ^beron<N>$ ^beroštupmi<N>$ ^beroštusvuostálasvuohta<N>$ through sme-nob.autobil.bin, and if beron gave a match, that line will be included, if beroštupmi didn't, it'll be excluded, etc. (If the bidix actually specified ^beron<N><Actor>$, it would still get included since it's a partial match; it's not perfect, but it saves a lot of trouble.)

So to add new words to the lexc:

the word has to be in giellatekno's lexc
the word has to be in bidix (apertium-sme-nob.sme-nob.dix) with a translation
the bidix has to be compiled (make sme-nob.autobil.bin)
and then you can run /usr/bin/python2.6 update-morph/update-lexc.py --config=update-morph/my-langs.cfg
- you create update-morph/my-langs.cfg by copying update-morph/langs.cfg.in and editing the SRC line to point to where you checked out the sme morphology from Giellatekno svn
and then you can run make to compile the analyser

For simple copy-pasting, the last three steps are:

make sme-nob.autobil.bin &&
/usr/bin/python2.6 update-morph/update-lexc.py --config=update-morph/my-langs.cfg &&
make

(given that /usr/bin/python2.6 is your python2 version, and your personal copy of langs.cfg.in is stored as update-morph/my-langs.cfg)

TODO

regex vs twol

Investigate if we can use at least some of the xfst scripts from giellatekno instead of xfst2apertium.useless.twol.

Misc

add entries from bidix that are missing from the analyser
regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

a list of high-frequency typos where the correction has an analysis

Dashes

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag <dash>), but perhaps we could add a tag there …

Multiwords

Add simple multiwords and fixed expressions to the analyser.

dasa lassin => i tillegg (til det)
dán áigge => for tiden
mun ieš => meg selv
bures boahtin => velkommen
Buorre beaivi => God dag
leat guollebivddus => å fiske
maid ban dainna => hva i all verden
jagis jahkái => fra år til år
oaidnaleapmái => 'see you'
ovdamearkka => for eksempel
Mo manná? => Hvordan går det?
ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

@@ Line 50: / Line 50: @@
 ==TODO==
+===regex vs twol===
+* Investigate if we can use at least some of the xfst scripts from giellatekno instead of xfst2apertium.useless.twol.
 ===Misc===
 * add entries from bidix that are missing from the analyser

Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 14:12, 12 April 2012

Description

Contents

Updating the lexc when giellatekno/bidix changes

TODO

regex vs twol

Misc

Typos

Dashes

Multiwords

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools