Comparison of part-of-speech tagging systems
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
|Catalan||1,413||24,144||81.85||83.96||[75.65, 78.46]||[87.76, 90.48]||[94.16, 96.28]||[93.92, 96.16]|
|Spanish||1,271||21,247||86.18||86.71||[78.20, 80.06]||[87.72, 90.27]||[90.15, 94.86]||[91.84, 93.70]|
|Serbo-Croatian||1,190||20,128||75.22||79.67||[75.36, 78.79]||[75.36, 77.28]|
|Kazakh||403||4,348||80.25||86.13||[83.55, 86.19]||[83.33, 86.61]||n/a||n/a|
1st: Selects the first analysis from the morphological analyser
CG: Uses the CG (from the monolingual language package in languages) to preprocess the input.
Unigram: Lexicalised unigram tagger
apertium-tagger: Uses the bigram HMM tagger included with Apertium.
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python