Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:
   
 
==Description==
 
==Description==
{{TOCD}}
 
   
The sme morphological analyser is a ''trimmed'' version of the one in Giellatekno. Everything we get from Giellatekno is contained in the files
+
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
   
  +
===Trimming===
* '''apertium-sme-nob.sme.lexc''' (lexicon)
 
   
  +
Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]].
* '''apertium-sme-nob.sme.twol''' (two-level morphology)
 
 
The twol file is a plain copy of twol-sme.txt. The lexc file is a concatenation of sme-lex.txt and the various POS-sme-lex.txt files in gt/sme/src (e.g. verb-sme-lex.txt, adj-sme-lex.txt). However, for each of those POS-files, the apertium lexc file only contains lines where the lemma exists in bidix (see the below section).
 
   
 
===Tagset changes===
 
===Tagset changes===
   
The tagset in apertium-sme-nob is, for various reasons, a bit different from the one in Giellatekno.
+
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
  +
We don't use the regex rules (gt/common/src/*.xfst) to remove Der2 and Use/Sub tags (though we may start doing this later). Tag fixes happens using a few two-level rules and similar that are composed on the analyser:
 
 
#* We remove Usage tags (+Use/Sub, etc), see the set ''Useless''
# '''[http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/xfst2apertium.useless.twol?revision=HEAD xfst2apertium.useless.twol]'''
 
 
#* We remove any derivational analyses that aren't yet handled by transfer/bidix
#* This file is composed first
 
#* It removes Usage tags (+Use/Sub, etc), see the set ''Useless''
 
#* It removes any derivational analyses that aren't yet handled by transfer/bidix, see the set ''UnhandledDerivation''
 
 
#** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
 
#** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
#* It also removes the - from split compound lemmas so that they may be looked up in bidix.
+
#* We remove the - from split compound lemmas so that they may be looked up in bidix.
 
#* We remove the #-mark between those compounds that are lexicalised/non-dynamic (this should not be necessary any longer?)
# '''[http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/xfst2apertium.hashtags.twol?revision=HEAD xfst2apertium.hashtags.twol]'''
 
 
#* We also ensure the +G3 tag occurs ''after'' the +N tag, a common upstream bug in the lexc files
#* This file is composed second
 
 
#* We change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.
#* It removes the #-mark between those compounds that are lexicalised/non-dynamic (this should not be necessary any longer)
 
#* It also ensures the +G3 tag occurs ''after'' the +N tag, a common error in the lexc files
 
# '''[http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/xfst2apertium.relabel?revision=HEAD xfst2apertium.relabel]'''
 
#* This file is used with hfst-substitute to change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.
 
 
===Updating the lexc when giellatekno/bidix changes===
 
We keep the lexc file up to date with the bidix and the giellatekno entries with the python script '''update-morph/update-lexc.py''' and a per-user configuration file based on '''update-morph/langs.cfg.in'''. The configuration file tells which -lex.txt source files are to be plain copied, and which are to be trimmed, and any POS tags to restrict the trimming to. For trimming, it loads the compiled bidix FST ('''sme-nob.autobil.bin'''), and, for each of the lines in the files that are to be trimmed, it checks if the lemma (plus possible POS tags) is possible to analyse with the FST. So if noun-sme-lex.txt has
 
<pre>
 
beron GAHPIR ;
 
beroštupmi:berošt UPMI ;
 
beroštus#riidu:beroštus#rij'du ALBMI ;
 
beroštus#vuostálasvuohta+CmpN/SgG+CmpN/DefPlGen:beroštus#vuostálasvuoh'ta LUONDU ;
 
</pre>
 
and the config says to append <code><N></code> when trimming nouns, it will try sending <code>^beron<N>$ ^beroštupmi<N>$ ^beroštusvuostálasvuohta<N>$</code> through sme-nob.autobil.bin, and if beron gave a match, that line will be included, if beroštupmi didn't, it'll be excluded, etc. (If the bidix actually specified <code>^beron<N><Actor>$</code>, it would still get included since it's a partial match; it's not perfect, but it saves a lot of trouble.)
 
 
So to add new words to the lexc:
 
# the word has to be in the relevant gt/sme/src/SOMETHING.lex.txt file in your copy of Giellatekno SVN
 
# the word has to be in bidix ('''apertium-sme-nob.sme-nob.dix''') with a translation
 
# you have to have created '''update-morph/langs.cfg'''
 
#* You do this by copying <code>update-morph/langs.cfg.in</code> and editing the <code>SRC</code> line to point to where you checked out the gt/sme/src directory from Giellatekno SVN, there's a README in the update-morph/ directory with more information.
 
# then you just run <code>make</code> to generate and compile the analyser (this will first ensure bidix is compiled)
 
 
Note: if you get a warning about not finding PYTHON, do you have to re-run autogen.sh like this: <code>sh.autogen.sh PYTHON=python2.6</code>, exchanging <code>python2.6</code> for whatever python2 version you have installed.
 
   
 
==TODO==
 
==TODO==

Revision as of 11:24, 13 June 2014

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

    • We remove Usage tags (+Use/Sub, etc), see the set Useless
    • We remove any derivational analyses that aren't yet handled by transfer/bidix
    • We remove the - from split compound lemmas so that they may be looked up in bidix.
    • We remove the #-mark between those compounds that are lexicalised/non-dynamic (this should not be necessary any longer?)
    • We also ensure the +G3 tag occurs after the +N tag, a common upstream bug in the lexc files
    • We change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.

TODO

regex vs twol

  • Investigate if we can use at least some of the xfst scripts from giellatekno instead of xfst2apertium.useless.twol.

Misc

  • add entries from bidix that are missing from the analyser
  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Dashes

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag <dash>), but perhaps we could add a tag there …

Multiwords

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.