Difference between revisions of "Lemmatisation"

From Apertium
Jump to navigation Jump to search
(Created page with "Here is an example of how to lemmatise using apertium-swe. Install: * Lttoolbox * Apertium * VISLCG <pre> svn co https://svn.code.sf.net/p/apertium/svn/languages/aperti...")
 
(GitHub migration)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
 
Here is an example of how to lemmatise using [[apertium-swe]].
 
Here is an example of how to lemmatise using [[apertium-swe]].
   
  +
First install:
Install:
 
   
  +
* [[lttoolbox]]
* Lttoolbox
 
  +
* [[apertium]]
* Apertium
 
  +
* [[vislcg3]]
* VISLCG
 
   
 
<pre>
 
<pre>
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-swe
+
git clone https://github.com/apertium/apertium-swe.git
 
</pre>
 
</pre>
 
edit file apertium-swe.swe.dix and decomment the "guesser" section
 
edit file apertium-swe.swe.dix and decomment the "guesser" section

Latest revision as of 02:24, 10 March 2018

Here is an example of how to lemmatise using apertium-swe.

First install:

git clone https://github.com/apertium/apertium-swe.git

edit file apertium-swe.swe.dix and decomment the "guesser" section

./configure 
make

To morphologically analyse:

$ echo "Den här är en test." | apertium -d . swe-disam

There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:

    LIST Guess = guess ;

    SECTION

    REMOVE Guess ;

You can save this to a file "guesser.cg3" and run it so:

$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3 

Other kind of output formats:

First compile the CG file:

$ cg-comp guesser.cg3 guesser.bin

Then you can use the guesser format using one line:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin 

Something like:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin 

Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'

If you want to pick just the first one and have unambiguous output, you can try:

$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin  |\
   sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin  | sed 's/\/[^\$]\+\$/$/g'