Difference between revisions of "Lemmatisation"
Jump to navigation
Jump to search
(Created page with "Here is an example of how to lemmatise using apertium-swe. Install: * Lttoolbox * Apertium * VISLCG <pre> svn co https://svn.code.sf.net/p/apertium/svn/languages/aperti...") |
(GitHub migration) |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
Here is an example of how to lemmatise using [[apertium-swe]]. |
Here is an example of how to lemmatise using [[apertium-swe]]. |
||
First install: |
|||
Install: |
|||
* [[lttoolbox]] |
|||
* Lttoolbox |
|||
* [[apertium]] |
|||
* Apertium |
|||
* [[vislcg3]] |
|||
* VISLCG |
|||
<pre> |
<pre> |
||
git clone https://github.com/apertium/apertium-swe.git |
|||
</pre> |
</pre> |
||
edit file apertium-swe.swe.dix and decomment the "guesser" section |
edit file apertium-swe.swe.dix and decomment the "guesser" section |
Latest revision as of 02:24, 10 March 2018
Here is an example of how to lemmatise using apertium-swe.
First install:
git clone https://github.com/apertium/apertium-swe.git
edit file apertium-swe.swe.dix and decomment the "guesser" section
./configure make
To morphologically analyse:
$ echo "Den här är en test." | apertium -d . swe-disam
There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:
LIST Guess = guess ; SECTION REMOVE Guess ;
You can save this to a file "guesser.cg3" and run it so:
$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3
Other kind of output formats:
First compile the CG file:
$ cg-comp guesser.cg3 guesser.bin
Then you can use the guesser format using one line:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin
Something like:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin
Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'
If you want to pick just the first one and have unambiguous output, you can try:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin |\ sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin | sed 's/\/[^\$]\+\$/$/g'