Comparison of part-of-speech tagging systems

From Apertium

Revision as of 16:39, 21 December 2015 by Francis Tyers (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

Contents

1 Todo

Apertium would like to have really good part-of-speech tagging, but in many cases falls below the state-of-the-art (around 97% tagging accuracy). This page intends to collect a comparison of tagging systems in Apertium and give some ideas of what could be done to improve them.

In the following table, the intervals represent the [low, high] values from 10-fold cross validation.

Language	Corpus		System
Language	Sent	Tok	1st	CG+1st	Unigram	CG+Unigram	apertium-tagger	CG+apertium-tagger
Catalan	1,413	24,144	81.85	83.96	[75.65, 78.46]	[87.76, 90.48]	[94.16, 96.28]	[93.92, 96.16]
Spanish	1,271	21,247	86.18	86.71	[78.20, 80.06]	[87.72, 90.27]	[90.15, 94.86]	[91.84, 93.70]
Kazakh	403	4,348	80.25	86.13	[83.55, 86.19]	[83.33, 86.61]	n/a	n/a

Todo

Implement this tagger: https://spacy.io/blog/part-of-speech-POS-tagger-in-python

Retrieved from "https://wiki.apertium.org/w/index.php?title=Comparison_of_part-of-speech_tagging_systems&oldid=55323"

Tools