Comparison of part-of-speech tagging systems
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
In the following table the two values a, b are representing tagger recall (= [true positives]/[total tokens]) and availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]) respectively. Values of the form x±y are the sample mean and standard deviation of the results of 10-fold cross validation.
System | Language | ||||||
---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | |
23,673 | 20,487 | 20,128 | 10,171 | 4,348 | 6,725 | 239 | |
1st | 86.50, 87.86 | 90.29, 91.77 | 75.22 | 75.63 | 80.79 | 76.70, 83.00 | 72.90 |
CG→1st | 88.05, 89.44 | 91.06, 92.55 | 79.67 | 79.52 | 86.19 | 87.99, 95.22 | 73.86 |
Unigram model 1 | 93.86±1.13, 95.33±1.05 | 93.93±0.97, 95.48±0.83 | 86.00±6.63, 94.74±2.44 | ||||
CG→Unigram model 1 | 94.34±1.11, 95.82±1.06 | 94.71±0.87, 96.27±0.67 | 88.42±6.55, 97.42±1.76 | ||||
Unigram model 2 | 93.90±1.09, 95.37±1.04 | 93.66±0.92, 95.20±0.75 | 87.13±6.23, 96.01±1.70 | ||||
CG→Unigram model 2 | 94.11±1.09, 95.58±1.07 | 94.30±0.81, 95.86±0.59 | 88.64±6.13, 97.70±1.32 | ||||
Unigram model 3 | 93.88±1.08, 95.35±1.03 | 93.64±0.93, 95.19±0.77 | 87.11±6.13, 95.99±1.52 | ||||
CG→Unigram model 3 | 94.09±1.08, 95.56±1.05 | 94.28±0.80, 95.83±0.60 | 88.63±6.07, 97.69±1.28 | ||||
Bigram (unsup, 0 iters) | 88.96±1.13, 90.36±1.18 | 88.42±1.49, 89.88±1.40 | 81.41±5.78, 89.72±2.06 | ||||
Bigram (unsup, 50 iters) | 91.74±1.15, 93.17±1.21 | 91.08±1.51, 92.58±1.39 | 81.09±5.99, 89.35±1.99 | ||||
Bigram (unsup, 250 iters) | 91.52±1.15, 92.96±1.21 | 90.81±1.51, 92.31±1.36 | 80.31±6.60, 88.45±2.51 | ||||
CG→Bigram (unsup, 0 iters) | 91.84±1.04, 93.27±1.11 | 91.33±1.38, 92.84±1.26 | 86.77±6.33, 95.61±1.77 | ||||
CG→Bigram (unsup, 50 iters) | 93.16±1.39, 94.62±1.49 | 92.45±1.28, 93.97±1.13 | 87.48±6.16, 96.41±1.38 | ||||
CG→Bigram (unsup, 250 iters) | 92.99±1.38, 94.45±1.48 | 92.45±1.28, 93.97±1.13 | 87.20±6.72, 96.06±2.05 | ||||
Bigram (sup) | 96.00±0.87, 97.50±0.93 | 95.42±1.06, 96.99±0.86 | 88.07±6.50, 97.03±1.75 | ||||
CG→Bigram (sup) | 96.00±1.13, 97.51±1.21 | 94.83±1.16, 96.40±0.91 | 88.73±6.36, 97.78±1.52 | ||||
Lwsw (0 iters) | 92.73±0.89, 94.18±0.91 | 92.78±0.94, 94.31±0.77 | 83.01±5.47, 91.51±1.22 | ||||
Lwsw (50 iters) | 92.99±0.85, 94.44±0.81 | 92.93±1.02, 94.46±0.84 | 82.70±5.76, 91.14±1.62 | ||||
Lwsw (250 iters) | 92.99±0.84, 94.44±0.79 | 92.98±1.02, 94.51±0.86 | 82.75±5.79, 91.20±1.64 | ||||
CG→Lwsw (0 iters) | 93.17±1.08, 94.63±1.08 | 92.64±1.07, 94.17±0.90 | 86.60±6.20, 95.43±1.52 | ||||
CG→Lwsw (50 iters) | 93.38±1.03, 94.84±1.01 | 92.66±1.16, 94.19±0.97 | 86.54±6.21, 95.36±1.54 | ||||
CG→Lwsw (250 iters) | 93.38±1.05, 94.84±1.03 | 92.69±1.17, 94.22±1.00 | 86.54±6.20, 95.36±1.53 | ||||
kaz-tagger | |||||||
CG→kaz-tagger |
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
Language | Corpus | System | |||||||
---|---|---|---|---|---|---|---|---|---|
Sent | Tok | Amb | 1st | CG+1st | Unigram | CG+Unigram | apertium-tagger | CG+apertium-tagger | |
Catalan | 1,413 | 24,144 | ? | 81.85 | 83.96 | [75.65, 78.46] | [87.76, 90.48] | [94.16, 96.28] | [93.92, 96.16] |
Spanish | 1,271 | 21,247 | ? | 86.18 | 86.71 | [78.20, 80.06] | [87.72, 90.27] | [90.15, 94.86] | [91.84, 93.70] |
Serbo-Croatian | 1,190 | 20,128 | ? | 75.22 | 79.67 | [75.36, 78.79] | [75.36, 77.28] | ||
Russian | 451 | 10,171 | ? | 75.63 | 79.52 | [70.49, 72.94] | [74.68, 78.65] | n/a | n/a |
Kazakh | 403 | 4,348 | ? | 80.79 | 86.19 | [84.36, 87.79] | [85.56, 88.72] | n/a | n/a |
Portuguese | 119 | 3,823 | ? | 72.54 | 87.34 | [77.10, 87.72] | [84.05, 91.96] | ||
Swedish | 11 | 239 | ? | 72.90 | 73.86 | [56.00, 82.97] |
Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
Systems
1st
: Selects the first analysis from the morphological analyserCG
: Uses the CG (from the monolingual language package in languages) to preprocess the input.Unigram
: Lexicalised unigram taggerapertium-tagger
: Uses the bigram HMM tagger included with Apertium.
Corpora
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/
subdirectory.
Todo
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python