Difference between revisions of "Comparison of part-of-speech tagging systems"

Revision as of 10:47, 25 December 2015

Language	Corpus			System
Language	Sent	Tok	Amb	1st	CG+1st	Unigram	CG+Unigram	apertium-tagger	CG+apertium-tagger
Catalan	1,413	24,144	?	81.85	83.96	[75.65, 78.46]	[87.76, 90.48]	[94.16, 96.28]	[93.92, 96.16]
Spanish	1,271	21,247	?	86.18	86.71	[78.20, 80.06]	[87.72, 90.27]	[90.15, 94.86]	[91.84, 93.70]
Serbo-Croatian	1,190	20,128	?	75.22	79.67	[75.36, 78.79]	[75.36, 77.28]
Russian	451	10,171	?	75.63	79.52	[70.49, 72.94]	[74.68, 78.65]	n/a	n/a
Kazakh	403	4,348	?	80.25	86.13	[83.55, 86.19]	[83.33, 86.61]	n/a	n/a

Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser

Systems

1st: Selects the first analysis from the morphological analyser
CG: Uses the CG (from the monolingual language package in languages) to preprocess the input.
Unigram: Lexicalised unigram tagger
apertium-tagger: Uses the bigram HMM tagger included with Apertium.

Corpora

The tagged corpora used in the experiments are found in the monolingual packages in languages, under the texts/ subdirectory.

Todo

Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python

@@ Line 6: / Line 6: @@
 {|class=wikitable
-!rowspan=2|Language !!colspan=2|Corpus !!colspan=6|System
+!rowspan=2|Language !!colspan=3|Corpus !!colspan=6|System
 |-
-                 ! Sent !! Tok  !! 1st !! CG+1st !! Unigram        || CG+Unigram      || apertium-tagger || CG+apertium-tagger
+                 ! Sent !! Tok !! Amb !! 1st !! CG+1st !! Unigram        || CG+Unigram      || apertium-tagger || CG+apertium-tagger
 |-
-| Catalan        || 1,413 || 24,144 || 81.85 || 83.96 || [75.65, 78.46]|| [87.76, 90.48] || [94.16, 96.28] || [93.92, 96.16]
+| Catalan        || 1,413 || 24,144 || ? || 81.85 || 83.96 || [75.65, 78.46]|| [87.76, 90.48] || [94.16, 96.28] || [93.92, 96.16]
 |-
-| Spanish        || 1,271 || 21,247 || 86.18 || 86.71 || [78.20, 80.06] || [87.72, 90.27] || [90.15, 94.86] || [91.84, 93.70]
+| Spanish        || 1,271 || 21,247 || ?|| 86.18 || 86.71 || [78.20, 80.06] || [87.72, 90.27] || [90.15, 94.86] || [91.84, 93.70]
 |-
-| Serbo-Croatian || 1,190 || 20,128 || 75.22 || 79.67 || [75.36, 78.79] || [75.36, 77.28] || ||
+| Serbo-Croatian || 1,190 || 20,128 || ?|| 75.22 || 79.67 || [75.36, 78.79] || [75.36, 77.28] || ||
 |-
-| Russian        || 451 || 10,171 ||  75.63   ||   79.52    || [70.49, 72.94] || [74.68, 78.65] || n/a          || n/a
+| Russian        || 451 || 10,171 || ?||  75.63   ||   79.52    || [70.49, 72.94] || [74.68, 78.65] || n/a          || n/a
 |-
-| Kazakh         || 403 || 4,348 || 80.25 || 86.13 || [83.55, 86.19] || [83.33, 86.61] || n/a            || n/a
+| Kazakh         || 403 || 4,348 || ? || 80.25 || 86.13 || [83.55, 86.19] || [83.33, 86.61] || n/a            || n/a
 |-
 |}
+Sent = sentences, Tok = tokens, Amb = average ambiguity from the morphological analyser
 ==Systems==

Difference between revisions of "Comparison of part-of-speech tagging systems"

Revision as of 10:47, 25 December 2015

Contents

Systems

Corpora

Todo

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools