Talk:Part-of-speech tagging
Contents
Hidden Markov models
A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states. The hidden states correspond to the "correct" set of tags for a given ambiguous sentence, this would be verb, pr, det, noun in the above example.
Ambiguity classes
In the apertium-tagger
, and indeed in many HMM based part-of-speech taggers, the set of observable states corresponds to a set of ambiguity classes. The ambiguity classes of a model are the set of possible ambiguities (often denoted with ). For example, in the above example, these would be (noun | verb) and (det | prn). The preposition "a" and the noun "playa" are unambiguous therefore don't belong to an ambiguity class. These are calculated automatically from the corpus to be used for training.
Lexical model
Syntactic model
Training
Preparation
- Corpora types
Untagged | Analysed | Tagged |
---|---|---|
Vino a la playa | Vino<verb>/<noun> a<pr> la<det>/<prn> playa<noun> |
Vino<verb> a<pr> la<det> playa<noun>
|
Voy a la casa | Voy<verb> a<pr> la<det>/<prn> casa<noun>/<verb> |
Voy<verb> a<pr> la<det> casa<noun>
|
Bebe vino en casa | Bebe<verb> vino<noun>/<verb> en<pr> casa<noun>/<verb> |
Bebe<verb> vino<noun> en<pr> casa<noun>
|
La casa es grande | La<det>/<prn> casa<noun>/<verb> es<verb> grande<adj> |
La<det> casa<noun> es<verb> grande<adj>
|
Es una ciudad grande | Es<verb> una<det>/<prn>/<verb> ciudad<noun> grande<adj> |
Es<verb> una<det> ciudad<noun> grande<adj>
|
- Ambiguity classes
- verb / noun
- det / prn
- det / prn / verb
- Transition counts
From the tagged examples we can extract the following transition counts:
Second tag | ||||||
---|---|---|---|---|---|---|
First tag | verb | noun | det | prn | pr | adj |
verb | 0 | 1 | 1 | 0 | 2 | 1 |
noun | 1 | 0 | 0 | 0 | 1 | 1 |
det | 0 | 4 | 0 | 0 | 0 | 0 |
prn | 0 | 0 | 0 | 0 | 0 | 0 |
pr | 0 | 1 | 2 | 0 | 0 | 0 |
adj | 0 | 0 | 0 | 0 | 0 | 0 |
Part-of-speech | ||||||
---|---|---|---|---|---|---|
Word | verb | noun | det | prn | pr | adj |
vino | 1 | 1 | 0 | 0 | 0 | 0 |
a | 0 | 0 | 0 | 0 | 2 | 0 |
la | 0 | 0 | 3 | 0 | 0 | 0 |
playa | 0 | 1 | 0 | 0 | 0 | 0 |
voy | 1 | 0 | 0 | 0 | 0 | 0 |
casa | 0 | 3 | 0 | 0 | 0 | 0 |
es | 2 | 0 | 0 | 0 | 0 | 0 |
grande | 0 | 0 | 0 | 0 | 0 | 2 |
una | 0 | 0 | 1 | 0 | 0 | 0 |
ciudad | 0 | 1 | 0 | 0 | 0 | 0 |
bebo | 1 | 0 | 0 | 0 | 0 | 0 |
en | 0 | 0 | 0 | 0 | 1 | 0 |
Parameter estimation
The apertium-tagger
has two options for training (or estimating the parameters of) an HMM. The choice of either depends on the availability of a pre-disambiguated corpus. The maximum-likelihood estimation (ML) algorithm relies on having a pre-tagged corpus.