Difference between revisions of "Comparison of part-of-speech tagging systems"
(Add kaz) |
|||
Line 62: | Line 62: | ||
| '''Percep (coarsebigram)''' ||align=right| 94.02±1.26 ||align=right| 94.79±0.86 ||align=right| 55.64±1.17 ||||||align=right| 87.04±6.23 ||||align=right| 90.87±0.87 |
| '''Percep (coarsebigram)''' ||align=right| 94.02±1.26 ||align=right| 94.79±0.86 ||align=right| 55.64±1.17 ||||||align=right| 87.04±6.23 ||||align=right| 90.87±0.87 |
||
|- |
|- |
||
− | | '''Percep (kaztags)''' ||align=right| 93.66±0.76 ||align=right| 94.28±0.93 ||align=right| 70.44±0.92 ||||||align=right| 87.07±6.16 ||align=right| 99.70±0.96 ||align=right| 90.64±1.13 |
+ | | '''Percep (kaztags)''' ||align=right| 93.66±0.76 ||align=right| 94.28±0.93 ||align=right| 70.44±0.92 ||||align=right| 91.41±2.09 ||align=right| 87.07±6.16 ||align=right| 99.70±0.96 ||align=right| 90.64±1.13 |
|- |
|- |
||
| '''Percep (spacycoarsetags)''' ||align=right| 95.06±1.01 ||align=right| 95.23±0.66 ||align=right| 56.34±1.21 ||||||align=right| 87.32±6.22 ||||align=right| 90.96±0.76 |
| '''Percep (spacycoarsetags)''' ||align=right| 95.06±1.01 ||align=right| 95.23±0.66 ||align=right| 56.34±1.21 ||||||align=right| 87.32±6.22 ||||align=right| 90.96±0.76 |
||
|- |
|- |
||
− | | '''Percep (spacyflattags)''' ||align=right| 95.25±0.85 ||align=right| 95.46±0.64 ||align=right| 73.02±1.12 ||||||align=right| 87.45±6.24 ||align=right| 99.70±0.96 ||align=right| 90.13±1.37 |
+ | | '''Percep (spacyflattags)''' ||align=right| 95.25±0.85 ||align=right| 95.46±0.64 ||align=right| 73.02±1.12 ||||align=right| 91.91±2.13 ||align=right| 87.45±6.24 ||align=right| 99.70±0.96 ||align=right| 90.13±1.37 |
|- |
|- |
||
− | | '''Percep (unigram)''' ||align=right| 93.59±0.77 ||align=right| 94.09±0.96 ||align=right| 70.11±0.97 ||||||align=right| 87.16±6.22 ||align=right| 99.70±0.96 ||align=right| 90.23±0.95 |
+ | | '''Percep (unigram)''' ||align=right| 93.59±0.77 ||align=right| 94.09±0.96 ||align=right| 70.11±0.97 ||||align=right| 91.08±2.13 ||align=right| 87.16±6.22 ||align=right| 99.70±0.96 ||align=right| 90.23±0.95 |
|- |
|- |
||
| '''CG→Percep (coarsebigram)''' ||align=right| 94.01±1.28 ||align=right| 94.75±0.69 ||align=right| 67.32±0.96 ||||||align=right| 88.70±6.29 ||||align=right| 89.25±1.17 |
| '''CG→Percep (coarsebigram)''' ||align=right| 94.01±1.28 ||align=right| 94.75±0.69 ||align=right| 67.32±0.96 ||||||align=right| 88.70±6.29 ||||align=right| 89.25±1.17 |
||
|- |
|- |
||
− | | '''CG→Percep (kaztags)''' ||align=right| 93.91±0.90 ||align=right| 94.72±0.88 ||align=right| 72.79±1.11 ||||||align=right| 88.72±6.23 ||align=right| 94.34±3.16 ||align=right| 89.82±1.29 |
+ | | '''CG→Percep (kaztags)''' ||align=right| 93.91±0.90 ||align=right| 94.72±0.88 ||align=right| 72.79±1.11 ||||align=right| 87.73±3.12 ||align=right| 88.72±6.23 ||align=right| 94.34±3.16 ||align=right| 89.82±1.29 |
|- |
|- |
||
| '''CG→Percep (spacycoarsetags)''' ||align=right| 94.93±1.12 ||align=right| 95.16±0.78 ||align=right| 67.81±1.11 ||||||align=right| 88.83±6.13 ||||align=right| 89.88±1.03 |
| '''CG→Percep (spacycoarsetags)''' ||align=right| 94.93±1.12 ||align=right| 95.16±0.78 ||align=right| 67.81±1.11 ||||||align=right| 88.83±6.13 ||||align=right| 89.88±1.03 |
||
|- |
|- |
||
− | | '''CG→Percep (spacyflattags)''' ||align=right| 95.19±0.98 ||align=right| 95.40±0.66 ||align=right| 72.80±0.76 ||||||align=right| 88.85±6.21 ||align=right| 94.34±3.16 ||align=right| 89.34±1.24 |
+ | | '''CG→Percep (spacyflattags)''' ||align=right| 95.19±0.98 ||align=right| 95.40±0.66 ||align=right| 72.80±0.76 ||||align=right| 87.62±2.83 ||align=right| 88.85±6.21 ||align=right| 94.34±3.16 ||align=right| 89.34±1.24 |
|- |
|- |
||
− | | '''CG→Percep (unigram)''' ||align=right| 93.87±0.92 ||align=right| 94.73±0.77 ||align=right| 72.42±0.86 ||||||align=right| 88.81±6.28 ||align=right| 94.34±3.16 ||align=right| 89.39±1.24 |
+ | | '''CG→Percep (unigram)''' ||align=right| 93.87±0.92 ||align=right| 94.73±0.77 ||align=right| 72.42±0.86 ||||align=right| 87.52±3.09 ||align=right| 88.81±6.28 ||align=right| 94.34±3.16 ||align=right| 89.39±1.24 |
|} |
|} |
||
Line 136: | Line 136: | ||
| '''Percep (coarsebigram)''' ||align=right| 95.71±1.36 ||align=right| 96.60±0.75 ||align=right| 61.99±1.24 ||||||align=right| 95.92±1.60 ||||align=right| 92.89±1.10 |
| '''Percep (coarsebigram)''' ||align=right| 95.71±1.36 ||align=right| 96.60±0.75 ||align=right| 61.99±1.24 ||||||align=right| 95.92±1.60 ||||align=right| 92.89±1.10 |
||
|- |
|- |
||
− | | '''Percep (kaztags)''' ||align=right| 95.34±0.77 ||align=right| 96.08±0.69 ||align=right| 78.47±0.99 ||||||align=right| 95.95±1.69 ||align=right| 99.70±0.96 ||align=right| 92.67±1.31 |
+ | | '''Percep (kaztags)''' ||align=right| 95.34±0.77 ||align=right| 96.08±0.69 ||align=right| 78.47±0.99 ||||align=right| 91.41±2.08 ||align=right| 95.95±1.69 ||align=right| 99.70±0.96 ||align=right| 92.67±1.31 |
|- |
|- |
||
| '''Percep (spacycoarsetags)''' ||align=right| 96.76±1.06 ||align=right| 97.05±0.56 ||align=right| 62.77±1.29 ||||||align=right| 96.22±1.52 ||||align=right| 92.99±0.93 |
| '''Percep (spacycoarsetags)''' ||align=right| 96.76±1.06 ||align=right| 97.05±0.56 ||align=right| 62.77±1.29 ||||||align=right| 96.22±1.52 ||||align=right| 92.99±0.93 |
||
|- |
|- |
||
− | | '''Percep (spacyflattags)''' ||align=right| 96.96±0.87 ||align=right| 97.28±0.58 ||align=right| 81.35±1.19 ||||||align=right| 96.37±1.53 ||align=right| 99.70±0.96 ||align=right| 92.14±1.44 |
+ | | '''Percep (spacyflattags)''' ||align=right| 96.96±0.87 ||align=right| 97.28±0.58 ||align=right| 81.35±1.19 ||||align=right| 91.92±2.12 ||align=right| 96.37±1.53 ||align=right| 99.70±0.96 ||align=right| 92.14±1.44 |
|- |
|- |
||
− | | '''Percep (unigram)''' ||align=right| 95.27±0.76 ||align=right| 95.89±0.74 ||align=right| 78.11±1.03 ||||||align=right| 96.05±1.64 ||align=right| 99.70±0.96 ||align=right| 92.24±1.11 |
+ | | '''Percep (unigram)''' ||align=right| 95.27±0.76 ||align=right| 95.89±0.74 ||align=right| 78.11±1.03 ||||align=right| 91.08±2.12 ||align=right| 96.05±1.64 ||align=right| 99.70±0.96 ||align=right| 92.24±1.11 |
|- |
|- |
||
| '''CG→Percep (coarsebigram)''' ||align=right| 95.70±1.37 ||align=right| 96.55±0.55 ||align=right| 75.00±1.04 ||||||align=right| 97.75±1.47 ||||align=right| 91.25±1.50 |
| '''CG→Percep (coarsebigram)''' ||align=right| 95.70±1.37 ||align=right| 96.55±0.55 ||align=right| 75.00±1.04 ||||||align=right| 97.75±1.47 ||||align=right| 91.25±1.50 |
||
|- |
|- |
||
− | | '''CG→Percep (kaztags)''' ||align=right| 95.59±0.92 ||align=right| 96.53±0.66 ||align=right| 81.10±1.20 ||||||align=right| 97.78±1.41 ||align=right| 94.34±3.16 ||align=right| 91.83±1.50 |
+ | | '''CG→Percep (kaztags)''' ||align=right| 95.59±0.92 ||align=right| 96.53±0.66 ||align=right| 81.10±1.20 ||||align=right| 87.74±3.11 ||align=right| 97.78±1.41 ||align=right| 94.34±3.16 ||align=right| 91.83±1.50 |
|- |
|- |
||
| '''CG→Percep (spacycoarsetags)''' ||align=right| 96.64±1.17 ||align=right| 96.98±0.64 ||align=right| 75.54±1.31 ||||||align=right| 97.90±1.30 ||||align=right| 91.89±1.20 |
| '''CG→Percep (spacycoarsetags)''' ||align=right| 96.64±1.17 ||align=right| 96.98±0.64 ||align=right| 75.54±1.31 ||||||align=right| 97.90±1.30 ||||align=right| 91.89±1.20 |
||
|- |
|- |
||
− | | '''CG→Percep (spacyflattags)''' ||align=right| 96.90±1.02 ||align=right| 97.22±0.51 ||align=right| 81.10±0.86 ||||||align=right| 97.92±1.38 ||align=right| 94.34±3.16 ||align=right| 91.34±1.42 |
+ | | '''CG→Percep (spacyflattags)''' ||align=right| 96.90±1.02 ||align=right| 97.22±0.51 ||align=right| 81.10±0.86 ||||align=right| 87.62±2.82 ||align=right| 97.92±1.38 ||align=right| 94.34±3.16 ||align=right| 91.34±1.42 |
|- |
|- |
||
− | | '''CG→Percep (unigram)''' ||align=right| 95.55±0.92 ||align=right| 96.54±0.52 ||align=right| 80.68±0.93 ||||||align=right| 97.87±1.47 ||align=right| 94.34±3.16 ||align=right| 91.38±1.40 |
+ | | '''CG→Percep (unigram)''' ||align=right| 95.55±0.92 ||align=right| 96.54±0.52 ||align=right| 80.68±0.93 ||||align=right| 87.52±3.08 ||align=right| 97.87±1.47 ||align=right| 94.34±3.16 ||align=right| 91.38±1.40 |
|} |
|} |
||
Revision as of 08:06, 23 August 2016
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
The scripts to generate these results are written in Python and available from SVN, /branches/apertium-tagger/experiments/ : https://svn.code.sf.net/p/apertium/svn/branches/apertium-tagger/experiments/
In the following two tables, values of the form x±y are the sample mean and standard deviation of the results of 10-fold cross validation.
In the following table the values represent tagger recall (= [true positives]/[total tokens]):
System | Language | |||||||
---|---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | Italian | |
23,673 | 20,487 | 20,071 | 1,052 | 13,714 | 6,725 | 369 | 5,201 | |
1st | 86.50 | 90.34 | 44.99±1.20 | 38.19 | 72.08 | 76.70 | 34.70 | 82.28±3.05 |
Bigram (unsup, 0 iters) | 88.96±1.12 | 88.49±1.54 | 47.31±1.24 | 81.41±5.78 | 79.16±3.12 | |||
Bigram (unsup, 50 iters) | 91.74±1.15 | 91.13±1.52 | 48.28±1.33 | 81.09±5.99 | 84.93±2.71 | |||
Bigram (unsup, 250 iters) | 91.51±1.16 | 90.85±1.48 | 48.05±1.47 | 80.31±6.60 | 84.52±2.78 | |||
Lwsw (0 iters) | 92.73±0.89 | 92.86±0.95 | 43.56±1.20 | 83.01±5.47 | 86.12±2.96 | |||
Lwsw (50 iters) | 92.98±0.85 | 93.01±1.02 | 45.09±1.15 | 82.70±5.76 | 86.07±2.68 | |||
Lwsw (250 iters) | 92.99±0.84 | 93.06±1.02 | 45.13±1.17 | 82.75±5.79 | 86.08±2.67 | |||
CG→1st | 88.05 | 91.10 | 64.01±1.04 | 39.81 | 81.56 | 87.99 | 42.90 | 83.29±3.07 |
CG→Bigram (unsup, 0 iters) | 91.83±1.03 | 91.39±1.42 | 60.37±1.45 | 86.77±6.33 | 81.31±3.10 | |||
CG→Bigram (unsup, 50 iters) | 93.16±1.39 | 92.53±1.29 | 60.91±1.65 | 87.48±6.16 | 86.11±2.46 | |||
CG→Bigram (unsup, 250 iters) | 92.99±1.38 | 92.50±1.23 | 60.88±1.66 | 87.20±6.72 | 86.01±2.59 | |||
CG→Lwsw (0 iters) | 93.17±1.08 | 92.72±1.09 | 59.93±1.46 | 86.60±6.20 | 85.64±2.83 | |||
CG→Lwsw (50 iters) | 93.37±1.02 | 92.74±1.16 | 60.38±1.57 | 86.54±6.21 | 85.55±2.72 | |||
CG→Lwsw (250 iters) | 93.38±1.05 | 92.77±1.18 | 60.42±1.53 | 86.54±6.20 | 85.54±2.72 | |||
Unigram model 1 | 93.86±1.13 | 93.96±0.98 | 63.96±0.92 | 39.11±8.91 | 80.63±3.87 | 86.00±6.63 | 46.48±5.78 | 89.37±1.63 |
Unigram model 2 | 93.90±1.09 | 93.69±0.94 | 67.51±0.67 | 40.36±8.59 | 82.19±3.70 | 87.13±6.23 | 47.12±8.29 | 89.23±0.97 |
Unigram model 3 | 93.88±1.08 | 93.67±0.94 | 67.47±0.64 | 40.36±8.59 | 82.45±3.80 | 87.11±6.13 | 47.12±8.29 | 89.00±0.95 |
Bigram (sup) | 96.00±0.87 | 95.47±1.07 | 55.26±0.87 | 88.07±6.50 | ||||
CG→Unigram model 1 | 94.34±1.11 | 94.73±0.88 | 68.42±0.69 | 40.71±9.39 | 84.54±3.29 | 88.42±6.55 | 46.84±5.48 | 89.04±1.45 |
CG→Unigram model 2 | 94.11±1.09 | 94.33±0.82 | 68.93±0.72 | 41.43±9.21 | 84.62±3.47 | 88.64±6.13 | 47.07±7.39 | 88.67±0.93 |
CG→Unigram model 3 | 94.09±1.08 | 94.31±0.81 | 68.88±0.72 | 41.43±9.21 | 84.71±3.54 | 88.63±6.07 | 47.07±7.39 | 88.45±0.94 |
CG→Bigram (sup) | 96.00±1.13 | 94.88±1.18 | 65.66±1.16 | 88.73±6.36 | ||||
Percep (coarsebigram) | 94.02±1.26 | 94.79±0.86 | 55.64±1.17 | 87.04±6.23 | 90.87±0.87 | |||
Percep (kaztags) | 93.66±0.76 | 94.28±0.93 | 70.44±0.92 | 91.41±2.09 | 87.07±6.16 | 99.70±0.96 | 90.64±1.13 | |
Percep (spacycoarsetags) | 95.06±1.01 | 95.23±0.66 | 56.34±1.21 | 87.32±6.22 | 90.96±0.76 | |||
Percep (spacyflattags) | 95.25±0.85 | 95.46±0.64 | 73.02±1.12 | 91.91±2.13 | 87.45±6.24 | 99.70±0.96 | 90.13±1.37 | |
Percep (unigram) | 93.59±0.77 | 94.09±0.96 | 70.11±0.97 | 91.08±2.13 | 87.16±6.22 | 99.70±0.96 | 90.23±0.95 | |
CG→Percep (coarsebigram) | 94.01±1.28 | 94.75±0.69 | 67.32±0.96 | 88.70±6.29 | 89.25±1.17 | |||
CG→Percep (kaztags) | 93.91±0.90 | 94.72±0.88 | 72.79±1.11 | 87.73±3.12 | 88.72±6.23 | 94.34±3.16 | 89.82±1.29 | |
CG→Percep (spacycoarsetags) | 94.93±1.12 | 95.16±0.78 | 67.81±1.11 | 88.83±6.13 | 89.88±1.03 | |||
CG→Percep (spacyflattags) | 95.19±0.98 | 95.40±0.66 | 72.80±0.76 | 87.62±2.83 | 88.85±6.21 | 94.34±3.16 | 89.34±1.24 | |
CG→Percep (unigram) | 93.87±0.92 | 94.73±0.77 | 72.42±0.86 | 87.52±3.09 | 88.81±6.28 | 94.34±3.16 | 89.39±1.24 |
In the following table the values represent availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]). This data is also available in box plot form here:
System | Language | |||||||
---|---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | Italian | |
23,673 | 20,487 | 20,071 | 1,052 | 13,714 | 6,725 | 369 | 5,201 | |
1st | 87.86 | 91.82 | 52.56±1.53 | 75.93 | 77.72 | 83.00 | 64.47 | 82.77±3.09 |
Bigram (unsup, 0 iters) | 90.35±1.17 | 89.95±1.45 | 55.27±1.63 | 89.72±2.06 | 79.64±3.11 | |||
Bigram (unsup, 50 iters) | 93.17±1.21 | 92.63±1.40 | 56.40±1.70 | 89.35±1.99 | 85.45±2.78 | |||
Bigram (unsup, 250 iters) | 92.94±1.22 | 92.35±1.33 | 56.13±1.87 | 88.45±2.51 | 85.03±2.87 | |||
Lwsw (0 iters) | 94.18±0.91 | 94.40±0.77 | 50.88±1.54 | 91.51±1.22 | 86.64±3.15 | |||
Lwsw (50 iters) | 94.44±0.81 | 94.54±0.83 | 52.67±1.46 | 91.14±1.62 | 86.59±2.82 | |||
Lwsw (250 iters) | 94.44±0.79 | 94.60±0.84 | 52.72±1.50 | 91.20±1.64 | 86.60±2.81 | |||
CG→1st | 89.44 | 92.60 | 74.77±1.32 | 79.10 | 87.95 | 95.22 | 79.70 | 83.79±3.08 |
CG→Bigram (unsup, 0 iters) | 93.27±1.10 | 92.90±1.30 | 70.52±1.71 | 95.61±1.77 | 81.80±3.08 | |||
CG→Bigram (unsup, 50 iters) | 94.62±1.49 | 94.05±1.13 | 71.15±1.94 | 96.41±1.38 | 86.63±2.51 | |||
CG→Bigram (unsup, 250 iters) | 94.45±1.48 | 94.03±1.09 | 71.11±1.95 | 96.06±2.05 | 86.53±2.62 | |||
CG→Lwsw (0 iters) | 94.63±1.08 | 94.25±0.91 | 70.00±1.74 | 95.43±1.52 | 86.16±2.97 | |||
CG→Lwsw (50 iters) | 94.83±1.01 | 94.27±0.97 | 70.53±1.86 | 95.36±1.54 | 86.07±2.79 | |||
CG→Lwsw (250 iters) | 94.84±1.03 | 94.30±0.99 | 70.58±1.81 | 95.36±1.53 | 86.06±2.79 | |||
Unigram model 1 | 95.33±1.05 | 95.51±0.84 | 74.72±1.43 | 77.54±6.51 | 87.03±3.03 | 94.74±2.44 | 89.26±7.32 | 89.91±1.93 |
Unigram model 2 | 95.37±1.04 | 95.23±0.77 | 78.87±1.05 | 80.06±6.11 | 88.72±2.76 | 96.01±1.70 | 89.82±7.70 | 89.77±1.23 |
Unigram model 3 | 95.35±1.03 | 95.22±0.79 | 78.82±1.06 | 80.06±6.11 | 88.99±2.83 | 95.99±1.52 | 89.82±7.70 | 89.54±1.25 |
Bigram (sup) | 97.50±0.93 | 97.04±0.86 | 64.55±1.33 | 97.03±1.75 | ||||
CG→Unigram model 1 | 95.82±1.06 | 96.30±0.68 | 79.92±0.95 | 80.56±6.70 | 91.25±2.01 | 97.42±1.76 | 90.00±6.99 | 89.58±1.75 |
CG→Unigram model 2 | 95.58±1.07 | 95.89±0.59 | 80.51±0.95 | 82.06±6.50 | 91.33±2.15 | 97.70±1.32 | 89.97±7.50 | 89.21±1.13 |
CG→Unigram model 3 | 95.56±1.05 | 95.86±0.60 | 80.46±0.99 | 82.06±6.50 | 91.43±2.26 | 97.69±1.28 | 89.97±7.50 | 88.98±1.18 |
CG→Bigram (sup) | 97.51±1.21 | 96.45±0.93 | 76.70±1.46 | 97.78±1.52 | ||||
Percep (coarsebigram) | 95.71±1.36 | 96.60±0.75 | 61.99±1.24 | 95.92±1.60 | 92.89±1.10 | |||
Percep (kaztags) | 95.34±0.77 | 96.08±0.69 | 78.47±0.99 | 91.41±2.08 | 95.95±1.69 | 99.70±0.96 | 92.67±1.31 | |
Percep (spacycoarsetags) | 96.76±1.06 | 97.05±0.56 | 62.77±1.29 | 96.22±1.52 | 92.99±0.93 | |||
Percep (spacyflattags) | 96.96±0.87 | 97.28±0.58 | 81.35±1.19 | 91.92±2.12 | 96.37±1.53 | 99.70±0.96 | 92.14±1.44 | |
Percep (unigram) | 95.27±0.76 | 95.89±0.74 | 78.11±1.03 | 91.08±2.12 | 96.05±1.64 | 99.70±0.96 | 92.24±1.11 | |
CG→Percep (coarsebigram) | 95.70±1.37 | 96.55±0.55 | 75.00±1.04 | 97.75±1.47 | 91.25±1.50 | |||
CG→Percep (kaztags) | 95.59±0.92 | 96.53±0.66 | 81.10±1.20 | 87.74±3.11 | 97.78±1.41 | 94.34±3.16 | 91.83±1.50 | |
CG→Percep (spacycoarsetags) | 96.64±1.17 | 96.98±0.64 | 75.54±1.31 | 97.90±1.30 | 91.89±1.20 | |||
CG→Percep (spacyflattags) | 96.90±1.02 | 97.22±0.51 | 81.10±0.86 | 87.62±2.82 | 97.92±1.38 | 94.34±3.16 | 91.34±1.42 | |
CG→Percep (unigram) | 95.55±0.92 | 96.54±0.52 | 80.68±0.93 | 87.52±3.08 | 97.87±1.47 | 94.34±3.16 | 91.38±1.40 |
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
Language | Corpus | System | |||||||
---|---|---|---|---|---|---|---|---|---|
Sent | Tok | Amb | 1st | CG+1st | Unigram | CG+Unigram | apertium-tagger | CG+apertium-tagger | |
Catalan | 1,413 | 24,144 | ? | 81.85 | 83.96 | [75.65, 78.46] | [87.76, 90.48] | [94.16, 96.28] | [93.92, 96.16] |
Spanish | 1,271 | 21,247 | ? | 86.18 | 86.71 | [78.20, 80.06] | [87.72, 90.27] | [90.15, 94.86] | [91.84, 93.70] |
Serbo-Croatian | 1,190 | 20,128 | ? | 75.22 | 79.67 | [75.36, 78.79] | [75.36, 77.28] | ||
Russian | 451 | 10,171 | ? | 75.63 | 79.52 | [70.49, 72.94] | [74.68, 78.65] | n/a | n/a |
Kazakh | 403 | 4,348 | ? | 80.79 | 86.19 | [84.36, 87.79] | [85.56, 88.72] | n/a | n/a |
Portuguese | 119 | 3,823 | ? | 72.54 | 87.34 | [77.10, 87.72] | [84.05, 91.96] | ||
Swedish | 11 | 239 | ? | 72.90 | 73.86 | [56.00, 82.97] |
Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
Systems
1st
: Selects the first analysis from the morphological analyserCG
: Uses the CG (from the monolingual language package in languages) to preprocess the input.Unigram
: Lexicalised unigram taggerapertium-tagger
: Uses the bigram HMM tagger included with Apertium.
Corpora
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/
subdirectory.
Todo
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python