Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on
If you have any questions, please come and talk to us on
#apertium
on irc.freenode.net
or contact the GitHub migration team.Comparison of part-of-speech tagging systems
From Apertium
Revision as of 23:45, 4 January 2016 by Francis Tyers (Talk | contribs)
|
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
Language | Corpus | System | |||||||
---|---|---|---|---|---|---|---|---|---|
Sent | Tok | Amb | 1st | CG+1st | Unigram | CG+Unigram | apertium-tagger | CG+apertium-tagger | |
Catalan | 1,413 | 24,144 | ? | 81.85 | 83.96 | [75.65, 78.46] | [87.76, 90.48] | [94.16, 96.28] | [93.92, 96.16] |
Spanish | 1,271 | 21,247 | ? | 86.18 | 86.71 | [78.20, 80.06] | [87.72, 90.27] | [90.15, 94.86] | [91.84, 93.70] |
Serbo-Croatian | 1,190 | 20,128 | ? | 75.22 | 79.67 | [75.36, 78.79] | [75.36, 77.28] | ||
Russian | 451 | 10,171 | ? | 75.63 | 79.52 | [70.49, 72.94] | [74.68, 78.65] | n/a | n/a |
Kazakh | 403 | 4,348 | ? | 80.25 | 86.13 | [83.55, 86.19] | [83.33, 86.61] | n/a | n/a |
Portuguese | 119 | 3,823 | ? | 72.51 | 82.95 | [76.88, 92.42] | [79.90, 91.30] |
Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
Systems
-
1st
: Selects the first analysis from the morphological analyser -
CG
: Uses the CG (from the monolingual language package in languages) to preprocess the input. -
Unigram
: Lexicalised unigram tagger -
apertium-tagger
: Uses the bigram HMM tagger included with Apertium.
Corpora
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/
subdirectory.
Todo
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python