Comparison of part-of-speech tagging systems

System	Language
	Catalan	Spanish	Serbo-Croatian	Russian	Kazakh	Portuguese	Swedish
	23,673	20,487	20,128	1,052	13,714	6,725	369
1st	86.50	90.34		38.19	72.08	76.70	34.70
Bigram (unsup, 0 iters)	88.96±1.12	88.49±1.54				81.41±5.78
Bigram (unsup, 50 iters)	91.74±1.15	91.13±1.52				81.09±5.99
Bigram (unsup, 250 iters)	91.51±1.16	90.85±1.48				80.31±6.60
Lwsw (0 iters)	92.73±0.89	92.86±0.95				83.01±5.47
Lwsw (50 iters)	92.98±0.85	93.01±1.02				82.70±5.76
Lwsw (250 iters)	92.99±0.84	93.06±1.02				82.75±5.79
CG→1st	88.05	91.10		39.81	81.56	87.99	42.90
CG→Bigram (unsup, 0 iters)	91.83±1.03	91.39±1.42				86.77±6.33
CG→Bigram (unsup, 50 iters)	93.16±1.39	92.53±1.29				87.48±6.16
CG→Bigram (unsup, 250 iters)	92.99±1.38	92.50±1.23				87.20±6.72
CG→Lwsw (0 iters)	93.17±1.08	92.72±1.09				86.60±6.20
CG→Lwsw (50 iters)	93.37±1.02	92.74±1.16				86.54±6.21
CG→Lwsw (250 iters)	93.38±1.05	92.77±1.18				86.54±6.20
Unigram model 1	93.86±1.13	93.96±0.98		39.11±8.91	80.63±3.87	86.00±6.63	46.48±5.78
Unigram model 2	93.90±1.09	93.69±0.94		40.36±8.59	82.19±3.70	87.13±6.23	47.12±8.29
Unigram model 3	93.88±1.08	93.67±0.94		40.36±8.59	82.45±3.80	87.11±6.13	47.12±8.29
Bigram (sup)	96.00±0.87	95.47±1.07				88.07±6.50
CG→Unigram model 1	94.34±1.11	94.73±0.88		40.71±9.39	84.54±3.29	88.42±6.55	46.84±5.48
CG→Unigram model 2	94.11±1.09	94.33±0.82		41.43±9.21	84.62±3.47	88.64±6.13	47.07±7.39
CG→Unigram model 3	94.09±1.08	94.31±0.81		41.43±9.21	84.71±3.54	88.63±6.07	47.07±7.39
CG→Bigram (sup)	96.00±1.13	94.88±1.18				88.73±6.36

In the following table the values represent availability adjusted tagger recall (= [true positives]/[words with a correct analysis from the morphological parser]). This data is also available in box plot form here:

System	Language
	Catalan	Spanish	Serbo-Croatian	Russian	Kazakh	Portuguese	Swedish
	23,673	20,487	20,128	1,052	13,714	6,725	369
1st	87.86	91.82		75.93	77.72	83.00	64.47
Bigram (unsup, 0 iters)	90.35±1.17	89.95±1.45				89.72±2.06
Bigram (unsup, 50 iters)	93.17±1.21	92.63±1.40				89.35±1.99
Bigram (unsup, 250 iters)	92.94±1.22	92.35±1.33				88.45±2.51
Lwsw (0 iters)	94.18±0.91	94.40±0.77				91.51±1.22
Lwsw (50 iters)	94.44±0.81	94.54±0.83				91.14±1.62
Lwsw (250 iters)	94.44±0.79	94.60±0.84				91.20±1.64
CG→1st	89.44	92.60		79.10	87.95	95.22	79.70
CG→Bigram (unsup, 0 iters)	93.27±1.10	92.90±1.30				95.61±1.77
CG→Bigram (unsup, 50 iters)	94.62±1.49	94.05±1.13				96.41±1.38
CG→Bigram (unsup, 250 iters)	94.45±1.48	94.03±1.09				96.06±2.05
CG→Lwsw (0 iters)	94.63±1.08	94.25±0.91				95.43±1.52
CG→Lwsw (50 iters)	94.83±1.01	94.27±0.97				95.36±1.54
CG→Lwsw (250 iters)	94.84±1.03	94.30±0.99				95.36±1.53
Unigram model 1	95.33±1.05	95.51±0.84		77.54±6.51	87.03±3.03	94.74±2.44	89.26±7.32
Unigram model 2	95.37±1.04	95.23±0.77		80.06±6.11	88.72±2.76	96.01±1.70	89.82±7.70
Unigram model 3	95.35±1.03	95.22±0.79		80.06±6.11	88.99±2.83	95.99±1.52	89.82±7.70
Bigram (sup)	97.50±0.93	97.04±0.86				97.03±1.75
CG→Unigram model 1	95.82±1.06	96.30±0.68		80.56±6.70	91.25±2.01	97.42±1.76	90.00±6.99
CG→Unigram model 2	95.58±1.07	95.89±0.59		82.06±6.50	91.33±2.15	97.70±1.32	89.97±7.50
CG→Unigram model 3	95.56±1.05	95.86±0.60		82.06±6.50	91.43±2.26	97.69±1.28	89.97±7.50
CG→Bigram (sup)	97.51±1.21	96.45±0.93				97.78±1.52

In the following table, the intervals represent the [low, high] values from 10-fold cross validation.

Language	Corpus			System
Language	Sent	Tok	Amb	1st	CG+1st	Unigram	CG+Unigram	apertium-tagger	CG+apertium-tagger
Catalan	1,413	24,144	?	81.85	83.96	[75.65, 78.46]	[87.76, 90.48]	[94.16, 96.28]	[93.92, 96.16]
Spanish	1,271	21,247	?	86.18	86.71	[78.20, 80.06]	[87.72, 90.27]	[90.15, 94.86]	[91.84, 93.70]
Serbo-Croatian	1,190	20,128	?	75.22	79.67	[75.36, 78.79]	[75.36, 77.28]
Russian	451	10,171	?	75.63	79.52	[70.49, 72.94]	[74.68, 78.65]	n/a	n/a
Kazakh	403	4,348	?	80.79	86.19	[84.36, 87.79]	[85.56, 88.72]	n/a	n/a
Portuguese	119	3,823	?	72.54	87.34	[77.10, 87.72]	[84.05, 91.96]
Swedish	11	239	?	72.90	73.86	[56.00, 82.97]

Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser

Systems

1st: Selects the first analysis from the morphological analyser
CG: Uses the CG (from the monolingual language package in languages) to preprocess the input.
Unigram: Lexicalised unigram tagger
apertium-tagger: Uses the bigram HMM tagger included with Apertium.

Corpora

The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/ subdirectory.

Todo

Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python

Comparison of part-of-speech tagging systems

Contents

Systems

Corpora

Todo

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools