Difference between revisions of "Lemmatisation"
Jump to navigation
Jump to search
(GitHub migration) |
|||
Line 8: | Line 8: | ||
<pre> |
<pre> |
||
git clone https://github.com/apertium/apertium-swe.git |
|||
</pre> |
</pre> |
||
edit file apertium-swe.swe.dix and decomment the "guesser" section |
edit file apertium-swe.swe.dix and decomment the "guesser" section |
Latest revision as of 02:24, 10 March 2018
Here is an example of how to lemmatise using apertium-swe.
First install:
git clone https://github.com/apertium/apertium-swe.git
edit file apertium-swe.swe.dix and decomment the "guesser" section
./configure make
To morphologically analyse:
$ echo "Den här är en test." | apertium -d . swe-disam
There will be a lot of ambiguity. A simple CG file will reduce this by removing the guessed analyses where there is an analysis in the lexicon:
LIST Guess = guess ; SECTION REMOVE Guess ;
You can save this to a file "guesser.cg3" and run it so:
$ echo "Den här är en test." | apertium -d . swe-disam | vislcg3 --grammar guesser.cg3
Other kind of output formats:
First compile the CG file:
$ cg-comp guesser.cg3 guesser.bin
Then you can use the guesser format using one line:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin
Something like:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin | sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin
Will give lemmatised output where the tokens are encased in ^ and $, and ambiguous stems/lemmas are given separated by '/'
If you want to pick just the first one and have unambiguous output, you can try:
$ echo "Den här är en test." | apertium -d . swe-tagger | cg-proc guesser.bin |\ sed 's/<[^>]\+>//g' | cg-proc -n guesser.bin | sed 's/\/[^\$]\+\$/$/g'