Comparison of part-of-speech tagging systems

From Apertium
Jump to navigation Jump to search

Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.

The scripts to generate these results are written in Python and available from SVN, /branches/apertium-tagger/experiments/ : https://svn.code.sf.net/p/apertium/svn/branches/apertium-tagger/experiments/

In the following two tables, values of the form x±y are the sample mean and standard deviation of the results of 10-fold cross validation.

In the following table the values represent tagger recall (= [true positives]/[total tokens]):

System Language
Catalan Spanish Serbo-Croatian Russian Kazakh Portuguese Swedish Italian
23,673 20,487 20,071 1,052 13,714 6,725 369 5,201
1st 86.50 90.34 44.99±1.20 38.19 72.08 76.70 34.70 82.28±3.05
Bigram (unsup, 0 iters) 88.96±1.12 88.49±1.54 47.31±1.24 81.41±5.78 79.16±3.12
Bigram (unsup, 50 iters) 91.74±1.15 91.13±1.52 48.28±1.33 81.09±5.99 84.93±2.71
Bigram (unsup, 250 iters) 91.51±1.16 90.85±1.48 48.05±1.47 80.31±6.60 84.52±2.78
Lwsw (0 iters) 92.73±0.89 92.86±0.95 43.56±1.20 83.01±5.47 86.12±2.96
Lwsw (50 iters) 92.98±0.85 93.01±1.02 45.09±1.15 82.70±5.76 86.07±2.68
Lwsw (250 iters) 92.99±0.84 93.06±1.02 45.13±1.17 82.75±5.79 86.08±2.67
CG→1st 88.05 91.10 64.01±1.04 39.81 81.56 87.99 42.90 83.29±3.07
CG→Bigram (unsup, 0 iters) 91.83±1.03 91.39±1.42 60.37±1.45 86.77±6.33 81.31±3.10
CG→Bigram (unsup, 50 iters) 93.16±1.39 92.53±1.29 60.91±1.65 87.48±6.16 86.11±2.46
CG→Bigram (unsup, 250 iters) 92.99±1.38 92.50±1.23 60.88±1.66 87.20±6.72 86.01±2.59
CG→Lwsw (0 iters) 93.17±1.08 92.72±1.09 59.93±1.46 86.60±6.20 85.64±2.83
CG→Lwsw (50 iters) 93.37±1.02 92.74±1.16 60.38±1.57 86.54±6.21 85.55±2.72
CG→Lwsw (250 iters) 93.38±1.05 92.77±1.18 60.42±1.53 86.54±6.20 85.54±2.72
Unigram model 1 93.86±1.13 93.96±0.98 63.96±0.92 39.11±8.91 80.63±3.87 86.00±6.63 46.48±5.78 89.37±1.63
Unigram model 2 93.90±1.09 93.69±0.94 67.51±0.67 40.36±8.59 82.19±3.70 87.13±6.23 47.12±8.29 89.23±0.97
Unigram model 3 93.88±1.08 93.67±0.94 67.47±0.64 40.36±8.59 82.45±3.80 87.11±6.13 47.12±8.29 89.00±0.95
Bigram (sup) 96.00±0.87 95.47±1.07 55.26±0.87 88.07±6.50
CG→Unigram model 1 94.34±1.11 94.73±0.88 68.42±0.69 40.71±9.39 84.54±3.29 88.42±6.55 46.84±5.48 89.04±1.45
CG→Unigram model 2 94.11±1.09 94.33±0.82 68.93±0.72 41.43±9.21 84.62±3.47 88.64±6.13 47.07±7.39 88.67±0.93
CG→Unigram model 3 94.09±1.08 94.31±0.81 68.88±0.72 41.43±9.21 84.71±3.54 88.63±6.07 47.07±7.39 88.45±0.94
CG→Bigram (sup) 96.00±1.13 94.88±1.18 65.66±1.16 88.73±6.36
Percep (coarsebigram) 94.02±1.26 94.79±0.86 55.64±1.17 87.04±6.23 90.87±0.87
Percep (kaztags) 93.66±0.76 94.28±0.93 70.44±0.92 91.41±2.09 87.07±6.16 99.70±0.96 90.64±1.13
Percep (spacycoarsetags) 95.06±1.01 95.23±0.66 56.34±1.21 87.32±6.22 90.96±0.76
Percep (spacyflattags) 95.25±0.85 95.46±0.64 73.02±1.12 91.91±2.13 87.45±6.24 99.70±0.96 90.13±1.37
Percep (unigram) 93.59±0.77 94.09±0.96 70.11±0.97 91.08±2.13 87.16±6.22 99.70±0.96 90.23±0.95
CG→Percep (coarsebigram) 94.01±1.28 94.75±0.69 67.32±0.96 88.70±6.29 89.25±1.17
CG→Percep (kaztags) 93.91±0.90 94.72±0.88 72.79±1.11 87.73±3.12 88.72±6.23 94.34±3.16 89.82±1.29
CG→Percep (spacycoarsetags) 94.93±1.12 95.16±0.78 67.81±1.11 88.83±6.13 89.88±1.03
CG→Percep (spacyflattags) 95.19±0.98 95.40±0.66 72.80±0.76 87.62±2.83 88.85±6.21 94.34±3.16 89.34±1.24
CG→Percep (unigram) 93.87±0.92 94.73±0.77 72.42±0.86 87.52±3.09 88.81±6.28 94.34±3.16 89.39±1.24

In the following table the values represent availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]). This data is also available in box plot form here:

System Language
Catalan Spanish Serbo-Croatian Russian Kazakh Portuguese Swedish Italian
23,673 20,487 20,071 1,052 13,714 6,725 369 5,201
1st 87.86 91.82 52.56±1.53 75.93 77.72 83.00 64.47 82.77±3.09
Bigram (unsup, 0 iters) 90.35±1.17 89.95±1.45 55.27±1.63 89.72±2.06 79.64±3.11
Bigram (unsup, 50 iters) 93.17±1.21 92.63±1.40 56.40±1.70 89.35±1.99 85.45±2.78
Bigram (unsup, 250 iters) 92.94±1.22 92.35±1.33 56.13±1.87 88.45±2.51 85.03±2.87
Lwsw (0 iters) 94.18±0.91 94.40±0.77 50.88±1.54 91.51±1.22 86.64±3.15
Lwsw (50 iters) 94.44±0.81 94.54±0.83 52.67±1.46 91.14±1.62 86.59±2.82
Lwsw (250 iters) 94.44±0.79 94.60±0.84 52.72±1.50 91.20±1.64 86.60±2.81
CG→1st 89.44 92.60 74.77±1.32 79.10 87.95 95.22 79.70 83.79±3.08
CG→Bigram (unsup, 0 iters) 93.27±1.10 92.90±1.30 70.52±1.71 95.61±1.77 81.80±3.08
CG→Bigram (unsup, 50 iters) 94.62±1.49 94.05±1.13 71.15±1.94 96.41±1.38 86.63±2.51
CG→Bigram (unsup, 250 iters) 94.45±1.48 94.03±1.09 71.11±1.95 96.06±2.05 86.53±2.62
CG→Lwsw (0 iters) 94.63±1.08 94.25±0.91 70.00±1.74 95.43±1.52 86.16±2.97
CG→Lwsw (50 iters) 94.83±1.01 94.27±0.97 70.53±1.86 95.36±1.54 86.07±2.79
CG→Lwsw (250 iters) 94.84±1.03 94.30±0.99 70.58±1.81 95.36±1.53 86.06±2.79
Unigram model 1 95.33±1.05 95.51±0.84 74.72±1.43 77.54±6.51 87.03±3.03 94.74±2.44 89.26±7.32 89.91±1.93
Unigram model 2 95.37±1.04 95.23±0.77 78.87±1.05 80.06±6.11 88.72±2.76 96.01±1.70 89.82±7.70 89.77±1.23
Unigram model 3 95.35±1.03 95.22±0.79 78.82±1.06 80.06±6.11 88.99±2.83 95.99±1.52 89.82±7.70 89.54±1.25
Bigram (sup) 97.50±0.93 97.04±0.86 64.55±1.33 97.03±1.75
CG→Unigram model 1 95.82±1.06 96.30±0.68 79.92±0.95 80.56±6.70 91.25±2.01 97.42±1.76 90.00±6.99 89.58±1.75
CG→Unigram model 2 95.58±1.07 95.89±0.59 80.51±0.95 82.06±6.50 91.33±2.15 97.70±1.32 89.97±7.50 89.21±1.13
CG→Unigram model 3 95.56±1.05 95.86±0.60 80.46±0.99 82.06±6.50 91.43±2.26 97.69±1.28 89.97±7.50 88.98±1.18
CG→Bigram (sup) 97.51±1.21 96.45±0.93 76.70±1.46 97.78±1.52
Percep (coarsebigram) 95.71±1.36 96.60±0.75 61.99±1.24 95.92±1.60 92.89±1.10
Percep (kaztags) 95.34±0.77 96.08±0.69 78.47±0.99 91.41±2.08 95.95±1.69 99.70±0.96 92.67±1.31
Percep (spacycoarsetags) 96.76±1.06 97.05±0.56 62.77±1.29 96.22±1.52 92.99±0.93
Percep (spacyflattags) 96.96±0.87 97.28±0.58 81.35±1.19 91.92±2.12 96.37±1.53 99.70±0.96 92.14±1.44
Percep (unigram) 95.27±0.76 95.89±0.74 78.11±1.03 91.08±2.12 96.05±1.64 99.70±0.96 92.24±1.11
CG→Percep (coarsebigram) 95.70±1.37 96.55±0.55 75.00±1.04 97.75±1.47 91.25±1.50
CG→Percep (kaztags) 95.59±0.92 96.53±0.66 81.10±1.20 87.74±3.11 97.78±1.41 94.34±3.16 91.83±1.50
CG→Percep (spacycoarsetags) 96.64±1.17 96.98±0.64 75.54±1.31 97.90±1.30 91.89±1.20
CG→Percep (spacyflattags) 96.90±1.02 97.22±0.51 81.10±0.86 87.62±2.82 97.92±1.38 94.34±3.16 91.34±1.42
CG→Percep (unigram) 95.55±0.92 96.54±0.52 80.68±0.93 87.52±3.08 97.87±1.47 94.34±3.16 91.38±1.40

In the following table, the intervals represent the [low, high] values from 10-fold cross validation.

Language Corpus System
Sent Tok Amb 1st CG+1st Unigram CG+Unigram apertium-tagger CG+apertium-tagger
Catalan 1,413 24,144 ? 81.85 83.96 [75.65, 78.46] [87.76, 90.48] [94.16, 96.28] [93.92, 96.16]
Spanish 1,271 21,247 ? 86.18 86.71 [78.20, 80.06] [87.72, 90.27] [90.15, 94.86] [91.84, 93.70]
Serbo-Croatian 1,190 20,128 ? 75.22 79.67 [75.36, 78.79] [75.36, 77.28]
Russian 451 10,171 ? 75.63 79.52 [70.49, 72.94] [74.68, 78.65] n/a n/a
Kazakh 403 4,348 ? 80.79 86.19 [84.36, 87.79] [85.56, 88.72] n/a n/a
Portuguese 119 3,823 ? 72.54 87.34 [77.10, 87.72] [84.05, 91.96]
Swedish 11 239 ? 72.90 73.86 [56.00, 82.97]

Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser

Systems[edit]

  • 1st: Selects the first analysis from the morphological analyser
  • CG: Uses the CG (from the monolingual language package in languages) to preprocess the input.
  • Unigram: Lexicalised unigram tagger
  • apertium-tagger: Uses the bigram HMM tagger included with Apertium.

Corpora[edit]

The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/ subdirectory.