Comparison of part-of-speech tagging systems
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
In the following two tables, values of the form x±y are the sample mean and standard deviation of the results of 10-fold cross validation.
In the following table the values represent tagger recall (= [true positives]/[total tokens]):
System | Language | ||||||
---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | |
23,673 | 20,487 | 20,128 | 1,052 | 13,714 | 6,725 | 369 | |
1st | 86.50 | 90.34 | 38.19 | 72.08 | 76.70 | 34.70 | |
Bigram (unsup, 0 iters) | 88.96±1.12 | 88.49±1.54 | 81.41±5.78 | ||||
Bigram (unsup, 50 iters) | 91.74±1.15 | 91.13±1.52 | 81.09±5.99 | ||||
Bigram (unsup, 250 iters) | 91.51±1.16 | 90.85±1.48 | 80.31±6.60 | ||||
Lwsw (0 iters) | 92.73±0.89 | 92.86±0.95 | 83.01±5.47 | ||||
Lwsw (50 iters) | 92.98±0.85 | 93.01±1.02 | 82.70±5.76 | ||||
Lwsw (250 iters) | 92.99±0.84 | 93.06±1.02 | 82.75±5.79 | ||||
CG→1st | 88.05 | 91.10 | 39.81 | 81.56 | 87.99 | 42.90 | |
CG→Bigram (unsup, 0 iters) | 91.83±1.03 | 91.39±1.42 | 86.77±6.33 | ||||
CG→Bigram (unsup, 50 iters) | 93.16±1.39 | 92.53±1.29 | 87.48±6.16 | ||||
CG→Bigram (unsup, 250 iters) | 92.99±1.38 | 92.50±1.23 | 87.20±6.72 | ||||
CG→Lwsw (0 iters) | 93.17±1.08 | 92.72±1.09 | 86.60±6.20 | ||||
CG→Lwsw (50 iters) | 93.37±1.02 | 92.74±1.16 | 86.54±6.21 | ||||
CG→Lwsw (250 iters) | 93.38±1.05 | 92.77±1.18 | 86.54±6.20 | ||||
Unigram model 1 | 93.86±1.13 | 93.96±0.98 | 39.11±8.91 | 80.63±3.87 | 86.00±6.63 | 46.48±5.78 | |
Unigram model 2 | 93.90±1.09 | 93.69±0.94 | 40.36±8.59 | 82.19±3.70 | 87.13±6.23 | 47.12±8.29 | |
Unigram model 3 | 93.88±1.08 | 93.67±0.94 | 40.36±8.59 | 82.45±3.80 | 87.11±6.13 | 47.12±8.29 | |
Bigram (sup) | 96.00±0.87 | 95.47±1.07 | 88.07±6.50 | ||||
CG→Unigram model 1 | 94.34±1.11 | 94.73±0.88 | 40.71±9.39 | 84.54±3.29 | 88.42±6.55 | 46.84±5.48 | |
CG→Unigram model 2 | 94.11±1.09 | 94.33±0.82 | 41.43±9.21 | 84.62±3.47 | 88.64±6.13 | 47.07±7.39 | |
CG→Unigram model 3 | 94.09±1.08 | 94.31±0.81 | 41.43±9.21 | 84.71±3.54 | 88.63±6.07 | 47.07±7.39 | |
CG→Bigram (sup) | 96.00±1.13 | 94.88±1.18 | 88.73±6.36 |
In the following table the values represent availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]). This data is also available in box plot form here:
System | Language | ||||||
---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | |
23,673 | 20,487 | 20,128 | 1,052 | 13,714 | 6,725 | 369 | |
1st | 87.86 | 91.82 | 75.93 | 77.72 | 83.00 | 64.47 | |
Bigram (unsup, 0 iters) | 90.35±1.17 | 89.95±1.45 | 89.72±2.06 | ||||
Bigram (unsup, 50 iters) | 93.17±1.21 | 92.63±1.40 | 89.35±1.99 | ||||
Bigram (unsup, 250 iters) | 92.94±1.22 | 92.35±1.33 | 88.45±2.51 | ||||
Lwsw (0 iters) | 94.18±0.91 | 94.40±0.77 | 91.51±1.22 | ||||
Lwsw (50 iters) | 94.44±0.81 | 94.54±0.83 | 91.14±1.62 | ||||
Lwsw (250 iters) | 94.44±0.79 | 94.60±0.84 | 91.20±1.64 | ||||
CG→1st | 89.44 | 92.60 | 79.10 | 87.95 | 95.22 | 79.70 | |
CG→Bigram (unsup, 0 iters) | 93.27±1.10 | 92.90±1.30 | 95.61±1.77 | ||||
CG→Bigram (unsup, 50 iters) | 94.62±1.49 | 94.05±1.13 | 96.41±1.38 | ||||
CG→Bigram (unsup, 250 iters) | 94.45±1.48 | 94.03±1.09 | 96.06±2.05 | ||||
CG→Lwsw (0 iters) | 94.63±1.08 | 94.25±0.91 | 95.43±1.52 | ||||
CG→Lwsw (50 iters) | 94.83±1.01 | 94.27±0.97 | 95.36±1.54 | ||||
CG→Lwsw (250 iters) | 94.84±1.03 | 94.30±0.99 | 95.36±1.53 | ||||
Unigram model 1 | 95.33±1.05 | 95.51±0.84 | 77.54±6.51 | 87.03±3.03 | 94.74±2.44 | 89.26±7.32 | |
Unigram model 2 | 95.37±1.04 | 95.23±0.77 | 80.06±6.11 | 88.72±2.76 | 96.01±1.70 | 89.82±7.70 | |
Unigram model 3 | 95.35±1.03 | 95.22±0.79 | 80.06±6.11 | 88.99±2.83 | 95.99±1.52 | 89.82±7.70 | |
Bigram (sup) | 97.50±0.93 | 97.04±0.86 | 97.03±1.75 | ||||
CG→Unigram model 1 | 95.82±1.06 | 96.30±0.68 | 80.56±6.70 | 91.25±2.01 | 97.42±1.76 | 90.00±6.99 | |
CG→Unigram model 2 | 95.58±1.07 | 95.89±0.59 | 82.06±6.50 | 91.33±2.15 | 97.70±1.32 | 89.97±7.50 | |
CG→Unigram model 3 | 95.56±1.05 | 95.86±0.60 | 82.06±6.50 | 91.43±2.26 | 97.69±1.28 | 89.97±7.50 | |
CG→Bigram (sup) | 97.51±1.21 | 96.45±0.93 | 97.78±1.52 |
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
Language | Corpus | System | |||||||
---|---|---|---|---|---|---|---|---|---|
Sent | Tok | Amb | 1st | CG+1st | Unigram | CG+Unigram | apertium-tagger | CG+apertium-tagger | |
Catalan | 1,413 | 24,144 | ? | 81.85 | 83.96 | [75.65, 78.46] | [87.76, 90.48] | [94.16, 96.28] | [93.92, 96.16] |
Spanish | 1,271 | 21,247 | ? | 86.18 | 86.71 | [78.20, 80.06] | [87.72, 90.27] | [90.15, 94.86] | [91.84, 93.70] |
Serbo-Croatian | 1,190 | 20,128 | ? | 75.22 | 79.67 | [75.36, 78.79] | [75.36, 77.28] | ||
Russian | 451 | 10,171 | ? | 75.63 | 79.52 | [70.49, 72.94] | [74.68, 78.65] | n/a | n/a |
Kazakh | 403 | 4,348 | ? | 80.79 | 86.19 | [84.36, 87.79] | [85.56, 88.72] | n/a | n/a |
Portuguese | 119 | 3,823 | ? | 72.54 | 87.34 | [77.10, 87.72] | [84.05, 91.96] | ||
Swedish | 11 | 239 | ? | 72.90 | 73.86 | [56.00, 82.97] |
Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
Systems
1st
: Selects the first analysis from the morphological analyserCG
: Uses the CG (from the monolingual language package in languages) to preprocess the input.Unigram
: Lexicalised unigram taggerapertium-tagger
: Uses the bigram HMM tagger included with Apertium.
Corpora
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/
subdirectory.
Todo
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python