Lemmatisation
Revision as of 11:25, 29 January 2016 by Francis Tyers (talk | contribs) (Created page with "Here is an example of how to lemmatise using apertium-swe. Install: * Lttoolbox * Apertium * VISLCG <pre> svn co https://svn.code.sf.net/p/apertium/svn/languages/aperti...")
Here is an example of how to lemmatise using apertium-swe.
Install:
- Lttoolbox
- Apertium
- VISLCG
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-swe
edit file apertium-swe.swe.dix and decomment the "guesser" section
./configure make
To morphologically analyse:
$ echo "Den här är en test." | apertium -d . swe-disam
There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:
LIST Guess = guess ; SECTION REMOVE Guess ;
You can save this to a file "guesser.cg3" and run it so:
$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3
Other kind of output formats:
First compile the CG file:
$ cg-comp guesser.cg3 guesser.bin
Then you can use the guesser format using one line:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin
Something like:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin
Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'
If you want to pick just the first one and have unambiguous output, you can try:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin |\ sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin | sed 's/\/[^\$]\+\$/$/g'