Difference between revisions of "Lemmatisation"

Latest revision as of 02:24, 10 March 2018

Here is an example of how to lemmatise using apertium-swe.

First install:

git clone https://github.com/apertium/apertium-swe.git

edit file apertium-swe.swe.dix and decomment the "guesser" section

./configure 
make

To morphologically analyse:

$ echo "Den här är en test." | apertium -d . swe-disam

There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:

    LIST Guess = guess ;

    SECTION

    REMOVE Guess ;

You can save this to a file "guesser.cg3" and run it so:

$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3

Other kind of output formats:

First compile the CG file:

$ cg-comp guesser.cg3 guesser.bin

Then you can use the guesser format using one line:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin

Something like:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin

Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'

If you want to pick just the first one and have unambiguous output, you can try:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  |\
   sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin  | sed 's/\/[^\$]\+\$/$/g'

Difference between revisions of "Lemmatisation"

Latest revision as of 02:24, 10 March 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 8: / Line 8: @@
 <pre>
-svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-swe
+git clone https://github.com/apertium/apertium-swe.git
 </pre>
 edit file apertium-swe.swe.dix and decomment the "guesser" section