Lemmatisation

From Apertium
Jump to navigation Jump to search

Here is an example of how to lemmatise using apertium-swe.

First install:

git clone https://github.com/apertium/apertium-swe.git

edit file apertium-swe.swe.dix and decomment the "guesser" section

./configure 
make

To morphologically analyse:

$ echo "Den här är en test." | apertium -d . swe-disam

There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:

    LIST Guess = guess ;

    SECTION

    REMOVE Guess ;

You can save this to a file "guesser.cg3" and run it so:

$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3 

Other kind of output formats:

First compile the CG file:

$ cg-comp guesser.cg3 guesser.bin

Then you can use the guesser format using one line:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin 

Something like:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin 

Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'

If you want to pick just the first one and have unambiguous output, you can try:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  |\
   sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin  | sed 's/\/[^\$]\+\$/$/g'