Lemmatisation

From Apertium
Revision as of 11:25, 29 January 2016 by Francis Tyers (talk | contribs) (Created page with "Here is an example of how to lemmatise using apertium-swe. Install: * Lttoolbox * Apertium * VISLCG <pre> svn co https://svn.code.sf.net/p/apertium/svn/languages/aperti...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Here is an example of how to lemmatise using apertium-swe.

Install:

  • Lttoolbox
  • Apertium
  • VISLCG
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-swe

edit file apertium-swe.swe.dix and decomment the "guesser" section

./configure 
make

To morphologically analyse:

$ echo "Den här är en test." | apertium -d . swe-disam

There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:

    LIST Guess = guess ;

    SECTION

    REMOVE Guess ;

You can save this to a file "guesser.cg3" and run it so:

$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3 

Other kind of output formats:

First compile the CG file:

$ cg-comp guesser.cg3 guesser.bin

Then you can use the guesser format using one line:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin 

Something like:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin 

Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'

If you want to pick just the first one and have unambiguous output, you can try:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  |\
   sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin  | sed 's/\/[^\$]\+\$/$/g'