Part-of-speech tagging is the process of assigning unambiguous grammatical categories to words in context. The crux of the problem is that surface forms of words can often be assigned more than one part-of-speech by morphological analysis. For example in English, the word "trap" can be both a singular noun ("a trap") or a verb ("I'll trap it").
This page intends to give an overview of how part-of-speech tagging works in Apertium, primarily within the
apertium-tagger, but giving a short overview of constraints (as in constraint grammar) and restrictions (as in
apertium-tagger) as well.
- See also: Morphological dictionaries
Consider the following sentence in Spanish ("She came to the beach"):
- Vino (noun or verb) a (pr) la (det or prn) playa (noun)
We can see that two out of the four words are ambiguous, "vino", which can be a noun ("wine") or verb ("came") and "la", which can be a determiner ("the") or a pronoun ("her" or "it"). This gives the following possibilities for the disambiguated analysis of the sentence:
- noun, pr, det, noun → Wine to the beach
- verb, pr, det, noun → She came to the beach
- noun, pr, prn, noun → Wine to it beach
- verb, pr, prn, noun → She came to it beach
As can be seen, only one of these interpretations (verb, pr, det, noun) yields the correct translation. So the task of part-of-speech tagging is to select the correct interpretation. There are a number of ways of doing this, involving both linguistically motivated rules (as constraint grammar and the Brill tagger) and statistically based (such as the TnT tagger or the ACOPOST tagger).
The tagger in Apertium (
apertium-tagger) uses a combination of rules and a statistical (hidden Markov) model.
Before we explain what a hidden Markov model is, we need to give some preliminaries, that is define what we mean by tagset and ambiguity class. The tagset (often shown as ) is the set of valid tags (parts of speech, etc.) to be used in the model, for example:
- '<noun>', '<verb>', '<adj>',
The ambiguity classes (noted as ) of a model are the set of possible ambiguities, for example between noun and verb, or verb and adjective, e.g.
- 'noun|verb', 'det|prn', 'det|prn|verb',
- Also referred to as "parts-of-speech", e.g. Noun, Verb, Adjective, Adverb, Conjunction, etc.