Difference between revisions of "Comparison of part-of-speech tagging systems"
Line 12: | Line 12: | ||
! Catalan !! Spanish !! Serbo-Croatian !! Russian !! Kazakh !! Portuguese !! Swedish !! Italian |
! Catalan !! Spanish !! Serbo-Croatian !! Russian !! Kazakh !! Portuguese !! Swedish !! Italian |
||
|- |
|- |
||
− | ! <small>23,673</small> !! <small>20,487</small> !! <small>20,071</small> !! <small>1,052</small> !! <small>13,714</small> !! <small>6,725</small> !! <small>369</small> |
+ | ! <small>23,673</small> !! <small>20,487</small> !! <small>20,071</small> !! <small>1,052</small> !! <small>13,714</small> !! <small>6,725</small> !! <small>369</small> !! <small>5,201</small> |
|- |
|- |
||
− | | '''1st''' ||align=right| 86.50 ||align=right| 90.34 ||align=right| 44.99±1.20 ||align=right| 38.19 ||align=right| 72.08 ||align=right| 76.70 ||align=right| 34.70 |
+ | | '''1st''' ||align=right| 86.50 ||align=right| 90.34 ||align=right| 44.99±1.20 ||align=right| 38.19 ||align=right| 72.08 ||align=right| 76.70 ||align=right| 34.70 ||align=right| 82.28±3.05 |
|- |
|- |
||
− | | '''Bigram (unsup, 0 iters)''' ||align=right| 88.96±1.12 ||align=right| 88.49±1.54 ||align=right| 47.31±1.24 ||||||align=right| 81.41±5.78 |
+ | | '''Bigram (unsup, 0 iters)''' ||align=right| 88.96±1.12 ||align=right| 88.49±1.54 ||align=right| 47.31±1.24 ||||||align=right| 81.41±5.78 ||||align=right| 79.16±3.12 |
|- |
|- |
||
− | | '''Bigram (unsup, 50 iters)''' ||align=right| 91.74±1.15 ||align=right| 91.13±1.52 ||align=right| 48.28±1.33 ||||||align=right| 81.09±5.99 |
+ | | '''Bigram (unsup, 50 iters)''' ||align=right| 91.74±1.15 ||align=right| 91.13±1.52 ||align=right| 48.28±1.33 ||||||align=right| 81.09±5.99 ||||align=right| 84.93±2.71 |
|- |
|- |
||
− | | '''Bigram (unsup, 250 iters)''' ||align=right| 91.51±1.16 ||align=right| 90.85±1.48 ||align=right| 48.05±1.47 ||||||align=right| 80.31±6.60 |
+ | | '''Bigram (unsup, 250 iters)''' ||align=right| 91.51±1.16 ||align=right| 90.85±1.48 ||align=right| 48.05±1.47 ||||||align=right| 80.31±6.60 ||||align=right| 84.52±2.78 |
|- |
|- |
||
− | | '''Lwsw (0 iters)''' ||align=right| 92.73±0.89 ||align=right| 92.86±0.95 ||align=right| 43.56±1.20 ||||||align=right| 83.01±5.47 |
+ | | '''Lwsw (0 iters)''' ||align=right| 92.73±0.89 ||align=right| 92.86±0.95 ||align=right| 43.56±1.20 ||||||align=right| 83.01±5.47 ||||align=right| 86.12±2.96 |
|- |
|- |
||
− | | '''Lwsw (50 iters)''' ||align=right| 92.98±0.85 ||align=right| 93.01±1.02 ||align=right| 45.09±1.15 ||||||align=right| 82.70±5.76 |
+ | | '''Lwsw (50 iters)''' ||align=right| 92.98±0.85 ||align=right| 93.01±1.02 ||align=right| 45.09±1.15 ||||||align=right| 82.70±5.76 ||||align=right| 86.07±2.68 |
|- |
|- |
||
− | | '''Lwsw (250 iters)''' ||align=right| 92.99±0.84 ||align=right| 93.06±1.02 ||align=right| 45.13±1.17 ||||||align=right| 82.75±5.79 |
+ | | '''Lwsw (250 iters)''' ||align=right| 92.99±0.84 ||align=right| 93.06±1.02 ||align=right| 45.13±1.17 ||||||align=right| 82.75±5.79 ||||align=right| 86.08±2.67 |
|- |
|- |
||
− | | '''CG→1st''' ||align=right| 88.05 ||align=right| 91.10 ||align=right| 64.01±1.04 ||align=right| 39.81 ||align=right| 81.56 ||align=right| 87.99 ||align=right| 42.90 |
+ | | '''CG→1st''' ||align=right| 88.05 ||align=right| 91.10 ||align=right| 64.01±1.04 ||align=right| 39.81 ||align=right| 81.56 ||align=right| 87.99 ||align=right| 42.90 ||align=right| 83.29±3.07 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 0 iters)''' ||align=right| 91.83±1.03 ||align=right| 91.39±1.42 ||align=right| 60.37±1.45 ||||||align=right| 86.77±6.33 |
+ | | '''CG→Bigram (unsup, 0 iters)''' ||align=right| 91.83±1.03 ||align=right| 91.39±1.42 ||align=right| 60.37±1.45 ||||||align=right| 86.77±6.33 ||||align=right| 81.31±3.10 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 50 iters)''' ||align=right| 93.16±1.39 ||align=right| 92.53±1.29 ||align=right| 60.91±1.65 ||||||align=right| 87.48±6.16 |
+ | | '''CG→Bigram (unsup, 50 iters)''' ||align=right| 93.16±1.39 ||align=right| 92.53±1.29 ||align=right| 60.91±1.65 ||||||align=right| 87.48±6.16 ||||align=right| 86.11±2.46 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 250 iters)''' ||align=right| 92.99±1.38 ||align=right| 92.50±1.23 ||align=right| 60.88±1.66 ||||||align=right| 87.20±6.72 |
+ | | '''CG→Bigram (unsup, 250 iters)''' ||align=right| 92.99±1.38 ||align=right| 92.50±1.23 ||align=right| 60.88±1.66 ||||||align=right| 87.20±6.72 ||||align=right| 86.01±2.59 |
|- |
|- |
||
− | | '''CG→Lwsw (0 iters)''' ||align=right| 93.17±1.08 ||align=right| 92.72±1.09 ||align=right| 59.93±1.46 ||||||align=right| 86.60±6.20 |
+ | | '''CG→Lwsw (0 iters)''' ||align=right| 93.17±1.08 ||align=right| 92.72±1.09 ||align=right| 59.93±1.46 ||||||align=right| 86.60±6.20 ||||align=right| 85.64±2.83 |
|- |
|- |
||
− | | '''CG→Lwsw (50 iters)''' ||align=right| 93.37±1.02 ||align=right| 92.74±1.16 ||align=right| 60.38±1.57 ||||||align=right| 86.54±6.21 |
+ | | '''CG→Lwsw (50 iters)''' ||align=right| 93.37±1.02 ||align=right| 92.74±1.16 ||align=right| 60.38±1.57 ||||||align=right| 86.54±6.21 ||||align=right| 85.55±2.72 |
|- |
|- |
||
− | | '''CG→Lwsw (250 iters)''' ||align=right| 93.38±1.05 ||align=right| 92.77±1.18 ||align=right| 60.42±1.53 ||||||align=right| 86.54±6.20 |
+ | | '''CG→Lwsw (250 iters)''' ||align=right| 93.38±1.05 ||align=right| 92.77±1.18 ||align=right| 60.42±1.53 ||||||align=right| 86.54±6.20 ||||align=right| 85.54±2.72 |
|- |
|- |
||
− | | '''Unigram model 1''' ||align=right| 93.86±1.13 ||align=right| 93.96±0.98 ||align=right| 63.96±0.92 ||align=right| 39.11±8.91 ||align=right| 80.63±3.87 ||align=right| 86.00±6.63 ||align=right| 46.48±5.78 |
+ | | '''Unigram model 1''' ||align=right| 93.86±1.13 ||align=right| 93.96±0.98 ||align=right| 63.96±0.92 ||align=right| 39.11±8.91 ||align=right| 80.63±3.87 ||align=right| 86.00±6.63 ||align=right| 46.48±5.78 ||align=right| 89.37±1.63 |
|- |
|- |
||
− | | '''Unigram model 2''' ||align=right| 93.90±1.09 ||align=right| 93.69±0.94 ||align=right| 67.51±0.67 ||align=right| 40.36±8.59 ||align=right| 82.19±3.70 ||align=right| 87.13±6.23 ||align=right| 47.12±8.29 |
+ | | '''Unigram model 2''' ||align=right| 93.90±1.09 ||align=right| 93.69±0.94 ||align=right| 67.51±0.67 ||align=right| 40.36±8.59 ||align=right| 82.19±3.70 ||align=right| 87.13±6.23 ||align=right| 47.12±8.29 ||align=right| 89.23±0.97 |
|- |
|- |
||
− | | '''Unigram model 3''' ||align=right| 93.88±1.08 ||align=right| 93.67±0.94 ||align=right| 67.47±0.64 ||align=right| 40.36±8.59 ||align=right| 82.45±3.80 ||align=right| 87.11±6.13 ||align=right| 47.12±8.29 |
+ | | '''Unigram model 3''' ||align=right| 93.88±1.08 ||align=right| 93.67±0.94 ||align=right| 67.47±0.64 ||align=right| 40.36±8.59 ||align=right| 82.45±3.80 ||align=right| 87.11±6.13 ||align=right| 47.12±8.29 ||align=right| 89.00±0.95 |
|- |
|- |
||
| '''Bigram (sup)''' ||align=right| 96.00±0.87 ||align=right| 95.47±1.07 ||align=right| 55.26±0.87 ||||||align=right| 88.07±6.50 |
| '''Bigram (sup)''' ||align=right| 96.00±0.87 ||align=right| 95.47±1.07 ||align=right| 55.26±0.87 ||||||align=right| 88.07±6.50 |
||
|- |
|- |
||
− | | '''CG→Unigram model 1''' ||align=right| 94.34±1.11 ||align=right| 94.73±0.88 ||align=right| 68.42±0.69 ||align=right| 40.71±9.39 ||align=right| 84.54±3.29 ||align=right| 88.42±6.55 ||align=right| 46.84±5.48 |
+ | | '''CG→Unigram model 1''' ||align=right| 94.34±1.11 ||align=right| 94.73±0.88 ||align=right| 68.42±0.69 ||align=right| 40.71±9.39 ||align=right| 84.54±3.29 ||align=right| 88.42±6.55 ||align=right| 46.84±5.48 ||align=right| 89.04±1.45 |
|- |
|- |
||
− | | '''CG→Unigram model 2''' ||align=right| 94.11±1.09 ||align=right| 94.33±0.82 ||align=right| 68.93±0.72 ||align=right| 41.43±9.21 ||align=right| 84.62±3.47 ||align=right| 88.64±6.13 ||align=right| 47.07±7.39 |
+ | | '''CG→Unigram model 2''' ||align=right| 94.11±1.09 ||align=right| 94.33±0.82 ||align=right| 68.93±0.72 ||align=right| 41.43±9.21 ||align=right| 84.62±3.47 ||align=right| 88.64±6.13 ||align=right| 47.07±7.39 ||align=right| 88.67±0.93 |
|- |
|- |
||
− | | '''CG→Unigram model 3''' ||align=right| 94.09±1.08 ||align=right| 94.31±0.81 ||align=right| 68.88±0.72 ||align=right| 41.43±9.21 ||align=right| 84.71±3.54 ||align=right| 88.63±6.07 ||align=right| 47.07±7.39 |
+ | | '''CG→Unigram model 3''' ||align=right| 94.09±1.08 ||align=right| 94.31±0.81 ||align=right| 68.88±0.72 ||align=right| 41.43±9.21 ||align=right| 84.71±3.54 ||align=right| 88.63±6.07 ||align=right| 47.07±7.39 ||align=right| 88.45±0.94 |
|- |
|- |
||
| '''CG→Bigram (sup)''' ||align=right| 96.00±1.13 ||align=right| 94.88±1.18 ||align=right| 65.66±1.16 ||||||align=right| 88.73±6.36 |
| '''CG→Bigram (sup)''' ||align=right| 96.00±1.13 ||align=right| 94.88±1.18 ||align=right| 65.66±1.16 ||||||align=right| 88.73±6.36 |
||
Line 64: | Line 64: | ||
!rowspan=3|System !!colspan=7|Language |
!rowspan=3|System !!colspan=7|Language |
||
|- |
|- |
||
− | ! Catalan !! Spanish !! Serbo-Croatian !! Russian !! Kazakh !! Portuguese !! Swedish |
+ | ! Catalan !! Spanish !! Serbo-Croatian !! Russian !! Kazakh !! Portuguese !! Swedish !! Italian |
|- |
|- |
||
− | ! <small>23,673</small> !! <small>20,487</small> !! <small>20,071</small> !! <small>1,052</small> !! <small>13,714</small> !! <small>6,725</small> !! <small>369</small> |
+ | ! <small>23,673</small> !! <small>20,487</small> !! <small>20,071</small> !! <small>1,052</small> !! <small>13,714</small> !! <small>6,725</small> !! <small>369</small> !! <small>5,201</small> |
|- |
|- |
||
− | | '''1st''' ||align=right| 87.86 ||align=right| 91.82 ||align=right| 52.56±1.53 ||align=right| 75.93 ||align=right| 77.72 ||align=right| 83.00 ||align=right| 64.47 |
+ | | '''1st''' ||align=right| 87.86 ||align=right| 91.82 ||align=right| 52.56±1.53 ||align=right| 75.93 ||align=right| 77.72 ||align=right| 83.00 ||align=right| 64.47 ||align=right| 82.77±3.09 |
|- |
|- |
||
− | | '''Bigram (unsup, 0 iters)''' ||align=right| 90.35±1.17 ||align=right| 89.95±1.45 ||align=right| 55.27±1.63 ||||||align=right| 89.72±2.06 |
+ | | '''Bigram (unsup, 0 iters)''' ||align=right| 90.35±1.17 ||align=right| 89.95±1.45 ||align=right| 55.27±1.63 ||||||align=right| 89.72±2.06 ||||align=right| 79.64±3.11 |
|- |
|- |
||
− | | '''Bigram (unsup, 50 iters)''' ||align=right| 93.17±1.21 ||align=right| 92.63±1.40 ||align=right| 56.40±1.70 ||||||align=right| 89.35±1.99 |
+ | | '''Bigram (unsup, 50 iters)''' ||align=right| 93.17±1.21 ||align=right| 92.63±1.40 ||align=right| 56.40±1.70 ||||||align=right| 89.35±1.99 ||||align=right| 85.45±2.78 |
|- |
|- |
||
− | | '''Bigram (unsup, 250 iters)''' ||align=right| 92.94±1.22 ||align=right| 92.35±1.33 ||align=right| 56.13±1.87 ||||||align=right| 88.45±2.51 |
+ | | '''Bigram (unsup, 250 iters)''' ||align=right| 92.94±1.22 ||align=right| 92.35±1.33 ||align=right| 56.13±1.87 ||||||align=right| 88.45±2.51 ||||align=right| 85.03±2.87 |
|- |
|- |
||
− | | '''Lwsw (0 iters)''' ||align=right| 94.18±0.91 ||align=right| 94.40±0.77 ||align=right| 50.88±1.54 ||||||align=right| 91.51±1.22 |
+ | | '''Lwsw (0 iters)''' ||align=right| 94.18±0.91 ||align=right| 94.40±0.77 ||align=right| 50.88±1.54 ||||||align=right| 91.51±1.22 ||||align=right| 86.64±3.15 |
|- |
|- |
||
− | | '''Lwsw (50 iters)''' ||align=right| 94.44±0.81 ||align=right| 94.54±0.83 ||align=right| 52.67±1.46 ||||||align=right| 91.14±1.62 |
+ | | '''Lwsw (50 iters)''' ||align=right| 94.44±0.81 ||align=right| 94.54±0.83 ||align=right| 52.67±1.46 ||||||align=right| 91.14±1.62 ||||align=right| 86.59±2.82 |
|- |
|- |
||
− | | '''Lwsw (250 iters)''' ||align=right| 94.44±0.79 ||align=right| 94.60±0.84 ||align=right| 52.72±1.50 ||||||align=right| 91.20±1.64 |
+ | | '''Lwsw (250 iters)''' ||align=right| 94.44±0.79 ||align=right| 94.60±0.84 ||align=right| 52.72±1.50 ||||||align=right| 91.20±1.64 ||||align=right| 86.60±2.81 |
|- |
|- |
||
− | | '''CG→1st''' ||align=right| 89.44 ||align=right| 92.60 ||align=right| 74.77±1.32 ||align=right| 79.10 ||align=right| 87.95 ||align=right| 95.22 ||align=right| 79.70 |
+ | | '''CG→1st''' ||align=right| 89.44 ||align=right| 92.60 ||align=right| 74.77±1.32 ||align=right| 79.10 ||align=right| 87.95 ||align=right| 95.22 ||align=right| 79.70 ||align=right| 83.79±3.08 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 0 iters)''' ||align=right| 93.27±1.10 ||align=right| 92.90±1.30 ||align=right| 70.52±1.71 ||||||align=right| 95.61±1.77 |
+ | | '''CG→Bigram (unsup, 0 iters)''' ||align=right| 93.27±1.10 ||align=right| 92.90±1.30 ||align=right| 70.52±1.71 ||||||align=right| 95.61±1.77 ||||align=right| 81.80±3.08 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 50 iters)''' ||align=right| 94.62±1.49 ||align=right| 94.05±1.13 ||align=right| 71.15±1.94 ||||||align=right| 96.41±1.38 |
+ | | '''CG→Bigram (unsup, 50 iters)''' ||align=right| 94.62±1.49 ||align=right| 94.05±1.13 ||align=right| 71.15±1.94 ||||||align=right| 96.41±1.38 ||||align=right| 86.63±2.51 |
|- |
|- |
||
− | | '''CG→Bigram (unsup, 250 iters)''' ||align=right| 94.45±1.48 ||align=right| 94.03±1.09 ||align=right| 71.11±1.95 ||||||align=right| 96.06±2.05 |
+ | | '''CG→Bigram (unsup, 250 iters)''' ||align=right| 94.45±1.48 ||align=right| 94.03±1.09 ||align=right| 71.11±1.95 ||||||align=right| 96.06±2.05 ||||align=right| 86.53±2.62 |
|- |
|- |
||
− | | '''CG→Lwsw (0 iters)''' ||align=right| 94.63±1.08 ||align=right| 94.25±0.91 ||align=right| 70.00±1.74 ||||||align=right| 95.43±1.52 |
+ | | '''CG→Lwsw (0 iters)''' ||align=right| 94.63±1.08 ||align=right| 94.25±0.91 ||align=right| 70.00±1.74 ||||||align=right| 95.43±1.52 ||||align=right| 86.16±2.97 |
|- |
|- |
||
− | | '''CG→Lwsw (50 iters)''' ||align=right| 94.83±1.01 ||align=right| 94.27±0.97 ||align=right| 70.53±1.86 ||||||align=right| 95.36±1.54 |
+ | | '''CG→Lwsw (50 iters)''' ||align=right| 94.83±1.01 ||align=right| 94.27±0.97 ||align=right| 70.53±1.86 ||||||align=right| 95.36±1.54 ||||align=right| 86.07±2.79 |
|- |
|- |
||
− | | '''CG→Lwsw (250 iters)''' ||align=right| 94.84±1.03 ||align=right| 94.30±0.99 ||align=right| 70.58±1.81 ||||||align=right| 95.36±1.53 |
+ | | '''CG→Lwsw (250 iters)''' ||align=right| 94.84±1.03 ||align=right| 94.30±0.99 ||align=right| 70.58±1.81 ||||||align=right| 95.36±1.53 ||||align=right| 86.06±2.79 |
|- |
|- |
||
− | | '''Unigram model 1''' ||align=right| 95.33±1.05 ||align=right| 95.51±0.84 ||align=right| 74.72±1.43 ||align=right| 77.54±6.51 ||align=right| 87.03±3.03 ||align=right| 94.74±2.44 ||align=right| 89.26±7.32 |
+ | | '''Unigram model 1''' ||align=right| 95.33±1.05 ||align=right| 95.51±0.84 ||align=right| 74.72±1.43 ||align=right| 77.54±6.51 ||align=right| 87.03±3.03 ||align=right| 94.74±2.44 ||align=right| 89.26±7.32 ||align=right| 89.91±1.93 |
|- |
|- |
||
− | | '''Unigram model 2''' ||align=right| 95.37±1.04 ||align=right| 95.23±0.77 ||align=right| 78.87±1.05 ||align=right| 80.06±6.11 ||align=right| 88.72±2.76 ||align=right| 96.01±1.70 ||align=right| 89.82±7.70 |
+ | | '''Unigram model 2''' ||align=right| 95.37±1.04 ||align=right| 95.23±0.77 ||align=right| 78.87±1.05 ||align=right| 80.06±6.11 ||align=right| 88.72±2.76 ||align=right| 96.01±1.70 ||align=right| 89.82±7.70 ||align=right| 89.77±1.23 |
|- |
|- |
||
− | | '''Unigram model 3''' ||align=right| 95.35±1.03 ||align=right| 95.22±0.79 ||align=right| 78.82±1.06 ||align=right| 80.06±6.11 ||align=right| 88.99±2.83 ||align=right| 95.99±1.52 ||align=right| 89.82±7.70 |
+ | | '''Unigram model 3''' ||align=right| 95.35±1.03 ||align=right| 95.22±0.79 ||align=right| 78.82±1.06 ||align=right| 80.06±6.11 ||align=right| 88.99±2.83 ||align=right| 95.99±1.52 ||align=right| 89.82±7.70 ||align=right| 89.54±1.25 |
|- |
|- |
||
| '''Bigram (sup)''' ||align=right| 97.50±0.93 ||align=right| 97.04±0.86 ||align=right| 64.55±1.33 ||||||align=right| 97.03±1.75 |
| '''Bigram (sup)''' ||align=right| 97.50±0.93 ||align=right| 97.04±0.86 ||align=right| 64.55±1.33 ||||||align=right| 97.03±1.75 |
||
|- |
|- |
||
− | | '''CG→Unigram model 1''' ||align=right| 95.82±1.06 ||align=right| 96.30±0.68 ||align=right| 79.92±0.95 ||align=right| 80.56±6.70 ||align=right| 91.25±2.01 ||align=right| 97.42±1.76 ||align=right| 90.00±6.99 |
+ | | '''CG→Unigram model 1''' ||align=right| 95.82±1.06 ||align=right| 96.30±0.68 ||align=right| 79.92±0.95 ||align=right| 80.56±6.70 ||align=right| 91.25±2.01 ||align=right| 97.42±1.76 ||align=right| 90.00±6.99 ||align=right| 89.58±1.75 |
|- |
|- |
||
− | | '''CG→Unigram model 2''' ||align=right| 95.58±1.07 ||align=right| 95.89±0.59 ||align=right| 80.51±0.95 ||align=right| 82.06±6.50 ||align=right| 91.33±2.15 ||align=right| 97.70±1.32 ||align=right| 89.97±7.50 |
+ | | '''CG→Unigram model 2''' ||align=right| 95.58±1.07 ||align=right| 95.89±0.59 ||align=right| 80.51±0.95 ||align=right| 82.06±6.50 ||align=right| 91.33±2.15 ||align=right| 97.70±1.32 ||align=right| 89.97±7.50 ||align=right| 89.21±1.13 |
|- |
|- |
||
− | | '''CG→Unigram model 3''' ||align=right| 95.56±1.05 ||align=right| 95.86±0.60 ||align=right| 80.46±0.99 ||align=right| 82.06±6.50 ||align=right| 91.43±2.26 ||align=right| 97.69±1.28 ||align=right| 89.97±7.50 |
+ | | '''CG→Unigram model 3''' ||align=right| 95.56±1.05 ||align=right| 95.86±0.60 ||align=right| 80.46±0.99 ||align=right| 82.06±6.50 ||align=right| 91.43±2.26 ||align=right| 97.69±1.28 ||align=right| 89.97±7.50 ||align=right| 88.98±1.18 |
|- |
|- |
||
| '''CG→Bigram (sup)''' ||align=right| 97.51±1.21 ||align=right| 96.45±0.93 ||align=right| 76.70±1.46 ||||||align=right| 97.78±1.52 |
| '''CG→Bigram (sup)''' ||align=right| 97.51±1.21 ||align=right| 96.45±0.93 ||align=right| 76.70±1.46 ||||||align=right| 97.78±1.52 |
Revision as of 19:38, 17 July 2016
Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.
In the following two tables, values of the form x±y are the sample mean and standard deviation of the results of 10-fold cross validation.
In the following table the values represent tagger recall (= [true positives]/[total tokens]):
System | Language | |||||||
---|---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | Italian | |
23,673 | 20,487 | 20,071 | 1,052 | 13,714 | 6,725 | 369 | 5,201 | |
1st | 86.50 | 90.34 | 44.99±1.20 | 38.19 | 72.08 | 76.70 | 34.70 | 82.28±3.05 |
Bigram (unsup, 0 iters) | 88.96±1.12 | 88.49±1.54 | 47.31±1.24 | 81.41±5.78 | 79.16±3.12 | |||
Bigram (unsup, 50 iters) | 91.74±1.15 | 91.13±1.52 | 48.28±1.33 | 81.09±5.99 | 84.93±2.71 | |||
Bigram (unsup, 250 iters) | 91.51±1.16 | 90.85±1.48 | 48.05±1.47 | 80.31±6.60 | 84.52±2.78 | |||
Lwsw (0 iters) | 92.73±0.89 | 92.86±0.95 | 43.56±1.20 | 83.01±5.47 | 86.12±2.96 | |||
Lwsw (50 iters) | 92.98±0.85 | 93.01±1.02 | 45.09±1.15 | 82.70±5.76 | 86.07±2.68 | |||
Lwsw (250 iters) | 92.99±0.84 | 93.06±1.02 | 45.13±1.17 | 82.75±5.79 | 86.08±2.67 | |||
CG→1st | 88.05 | 91.10 | 64.01±1.04 | 39.81 | 81.56 | 87.99 | 42.90 | 83.29±3.07 |
CG→Bigram (unsup, 0 iters) | 91.83±1.03 | 91.39±1.42 | 60.37±1.45 | 86.77±6.33 | 81.31±3.10 | |||
CG→Bigram (unsup, 50 iters) | 93.16±1.39 | 92.53±1.29 | 60.91±1.65 | 87.48±6.16 | 86.11±2.46 | |||
CG→Bigram (unsup, 250 iters) | 92.99±1.38 | 92.50±1.23 | 60.88±1.66 | 87.20±6.72 | 86.01±2.59 | |||
CG→Lwsw (0 iters) | 93.17±1.08 | 92.72±1.09 | 59.93±1.46 | 86.60±6.20 | 85.64±2.83 | |||
CG→Lwsw (50 iters) | 93.37±1.02 | 92.74±1.16 | 60.38±1.57 | 86.54±6.21 | 85.55±2.72 | |||
CG→Lwsw (250 iters) | 93.38±1.05 | 92.77±1.18 | 60.42±1.53 | 86.54±6.20 | 85.54±2.72 | |||
Unigram model 1 | 93.86±1.13 | 93.96±0.98 | 63.96±0.92 | 39.11±8.91 | 80.63±3.87 | 86.00±6.63 | 46.48±5.78 | 89.37±1.63 |
Unigram model 2 | 93.90±1.09 | 93.69±0.94 | 67.51±0.67 | 40.36±8.59 | 82.19±3.70 | 87.13±6.23 | 47.12±8.29 | 89.23±0.97 |
Unigram model 3 | 93.88±1.08 | 93.67±0.94 | 67.47±0.64 | 40.36±8.59 | 82.45±3.80 | 87.11±6.13 | 47.12±8.29 | 89.00±0.95 |
Bigram (sup) | 96.00±0.87 | 95.47±1.07 | 55.26±0.87 | 88.07±6.50 | ||||
CG→Unigram model 1 | 94.34±1.11 | 94.73±0.88 | 68.42±0.69 | 40.71±9.39 | 84.54±3.29 | 88.42±6.55 | 46.84±5.48 | 89.04±1.45 |
CG→Unigram model 2 | 94.11±1.09 | 94.33±0.82 | 68.93±0.72 | 41.43±9.21 | 84.62±3.47 | 88.64±6.13 | 47.07±7.39 | 88.67±0.93 |
CG→Unigram model 3 | 94.09±1.08 | 94.31±0.81 | 68.88±0.72 | 41.43±9.21 | 84.71±3.54 | 88.63±6.07 | 47.07±7.39 | 88.45±0.94 |
CG→Bigram (sup) | 96.00±1.13 | 94.88±1.18 | 65.66±1.16 | 88.73±6.36 |
In the following table the values represent availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]). This data is also available in box plot form here:
System | Language | |||||||
---|---|---|---|---|---|---|---|---|
Catalan | Spanish | Serbo-Croatian | Russian | Kazakh | Portuguese | Swedish | Italian | |
23,673 | 20,487 | 20,071 | 1,052 | 13,714 | 6,725 | 369 | 5,201 | |
1st | 87.86 | 91.82 | 52.56±1.53 | 75.93 | 77.72 | 83.00 | 64.47 | 82.77±3.09 |
Bigram (unsup, 0 iters) | 90.35±1.17 | 89.95±1.45 | 55.27±1.63 | 89.72±2.06 | 79.64±3.11 | |||
Bigram (unsup, 50 iters) | 93.17±1.21 | 92.63±1.40 | 56.40±1.70 | 89.35±1.99 | 85.45±2.78 | |||
Bigram (unsup, 250 iters) | 92.94±1.22 | 92.35±1.33 | 56.13±1.87 | 88.45±2.51 | 85.03±2.87 | |||
Lwsw (0 iters) | 94.18±0.91 | 94.40±0.77 | 50.88±1.54 | 91.51±1.22 | 86.64±3.15 | |||
Lwsw (50 iters) | 94.44±0.81 | 94.54±0.83 | 52.67±1.46 | 91.14±1.62 | 86.59±2.82 | |||
Lwsw (250 iters) | 94.44±0.79 | 94.60±0.84 | 52.72±1.50 | 91.20±1.64 | 86.60±2.81 | |||
CG→1st | 89.44 | 92.60 | 74.77±1.32 | 79.10 | 87.95 | 95.22 | 79.70 | 83.79±3.08 |
CG→Bigram (unsup, 0 iters) | 93.27±1.10 | 92.90±1.30 | 70.52±1.71 | 95.61±1.77 | 81.80±3.08 | |||
CG→Bigram (unsup, 50 iters) | 94.62±1.49 | 94.05±1.13 | 71.15±1.94 | 96.41±1.38 | 86.63±2.51 | |||
CG→Bigram (unsup, 250 iters) | 94.45±1.48 | 94.03±1.09 | 71.11±1.95 | 96.06±2.05 | 86.53±2.62 | |||
CG→Lwsw (0 iters) | 94.63±1.08 | 94.25±0.91 | 70.00±1.74 | 95.43±1.52 | 86.16±2.97 | |||
CG→Lwsw (50 iters) | 94.83±1.01 | 94.27±0.97 | 70.53±1.86 | 95.36±1.54 | 86.07±2.79 | |||
CG→Lwsw (250 iters) | 94.84±1.03 | 94.30±0.99 | 70.58±1.81 | 95.36±1.53 | 86.06±2.79 | |||
Unigram model 1 | 95.33±1.05 | 95.51±0.84 | 74.72±1.43 | 77.54±6.51 | 87.03±3.03 | 94.74±2.44 | 89.26±7.32 | 89.91±1.93 |
Unigram model 2 | 95.37±1.04 | 95.23±0.77 | 78.87±1.05 | 80.06±6.11 | 88.72±2.76 | 96.01±1.70 | 89.82±7.70 | 89.77±1.23 |
Unigram model 3 | 95.35±1.03 | 95.22±0.79 | 78.82±1.06 | 80.06±6.11 | 88.99±2.83 | 95.99±1.52 | 89.82±7.70 | 89.54±1.25 |
Bigram (sup) | 97.50±0.93 | 97.04±0.86 | 64.55±1.33 | 97.03±1.75 | ||||
CG→Unigram model 1 | 95.82±1.06 | 96.30±0.68 | 79.92±0.95 | 80.56±6.70 | 91.25±2.01 | 97.42±1.76 | 90.00±6.99 | 89.58±1.75 |
CG→Unigram model 2 | 95.58±1.07 | 95.89±0.59 | 80.51±0.95 | 82.06±6.50 | 91.33±2.15 | 97.70±1.32 | 89.97±7.50 | 89.21±1.13 |
CG→Unigram model 3 | 95.56±1.05 | 95.86±0.60 | 80.46±0.99 | 82.06±6.50 | 91.43±2.26 | 97.69±1.28 | 89.97±7.50 | 88.98±1.18 |
CG→Bigram (sup) | 97.51±1.21 | 96.45±0.93 | 76.70±1.46 | 97.78±1.52 |
In the following table, the intervals represent the [low, high] values from 10-fold cross validation.
Language | Corpus | System | |||||||
---|---|---|---|---|---|---|---|---|---|
Sent | Tok | Amb | 1st | CG+1st | Unigram | CG+Unigram | apertium-tagger | CG+apertium-tagger | |
Catalan | 1,413 | 24,144 | ? | 81.85 | 83.96 | [75.65, 78.46] | [87.76, 90.48] | [94.16, 96.28] | [93.92, 96.16] |
Spanish | 1,271 | 21,247 | ? | 86.18 | 86.71 | [78.20, 80.06] | [87.72, 90.27] | [90.15, 94.86] | [91.84, 93.70] |
Serbo-Croatian | 1,190 | 20,128 | ? | 75.22 | 79.67 | [75.36, 78.79] | [75.36, 77.28] | ||
Russian | 451 | 10,171 | ? | 75.63 | 79.52 | [70.49, 72.94] | [74.68, 78.65] | n/a | n/a |
Kazakh | 403 | 4,348 | ? | 80.79 | 86.19 | [84.36, 87.79] | [85.56, 88.72] | n/a | n/a |
Portuguese | 119 | 3,823 | ? | 72.54 | 87.34 | [77.10, 87.72] | [84.05, 91.96] | ||
Swedish | 11 | 239 | ? | 72.90 | 73.86 | [56.00, 82.97] |
Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
Systems
1st
: Selects the first analysis from the morphological analyserCG
: Uses the CG (from the monolingual language package in languages) to preprocess the input.Unigram
: Lexicalised unigram taggerapertium-tagger
: Uses the bigram HMM tagger included with Apertium.
Corpora
The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/
subdirectory.
Todo
- Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python