Difference between revisions of "Part-of-speech tagging"

From Apertium
Jump to navigation Jump to search
Line 47: Line 47:
   
 
:<math>\Sigma = \{ \sigma_1, \sigma_2, \ldots, \sigma_{|\Sigma|} \} = \{</math> 'noun|verb', 'det|prn', 'det|prn|verb', <math>\ldots \}~</math>
 
:<math>\Sigma = \{ \sigma_1, \sigma_2, \ldots, \sigma_{|\Sigma|} \} = \{</math> 'noun|verb', 'det|prn', 'det|prn|verb', <math>\ldots \}~</math>
  +
  +
==Hidden Markov models==
  +
  +
A hidden Markov model is made up of two matrices, representing '''transition''' and '''emission probabilities''' and a vector representing the '''initial probabilities''' of the model. The is often expressed as:
  +
  +
:<math>M = (A, B, \pi)~</math>
  +
  +
Where <math>M</math> is the model, <math>A</math> is the matrix of transition probabilities, <math>B</math> is the matrix of emission probabilities and <math>~\pi</math> is the vector of initial probabilities. These probabilities are calculated between the tag set and the ambiguity classes from a training set. This is referred to as '''parameter estimation'''.
  +
  +
==Parameter estimation==
  +
  +
===Maximum likelihood===
  +
  +
The easiest way to estimate the parameters of a hidden Markov model is to use maximum likelihood (ML). This method requires a pre-tagged corpus. We're going to make a very small training corpus so that we can train a model which can be used to disambiguate the example sentence above. The corpus is much smaller than would be normally used, but will let us demonstrate step-by-step how the model is constructed and used.
  +
  +
:{|class="wikitable"
  +
! Untagged !! Analysed !! Tagged
  +
|-
  +
| Vino a la playa || Vino{{fadetag|<verb>/<noun>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} playa{{fadetag|<noun>}} || Vino{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} playa{{fadetag|<noun>}}
  +
|-
  +
| Voy a la casa || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} casa{{fadetag|<noun>}}
  +
|-
  +
| Bebe vino en casa || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>/<verb>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>/<verb>}} || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>}}
  +
|-
  +
| La casa es grande || La{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}} || La{{fadetag|<det>}} casa{{fadetag|<noun>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}}
  +
|-
  +
| Es una ciudad grande || Es{{fadetag|<verb>}} una{{fadetag|<det>/<prn>/<verb>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}} || Es{{fadetag|<verb>}} una{{fadetag|<det>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}}
  +
|}
  +
  +
In this corpus, the "untagged" text would come from anywhere, the "analysed" text would be the result after being passed through a [[morphological analyser]], and the "tagged" text would be manually disambiguated from the analysed text by one or more humans.
   
 
==See also==
 
==See also==

Revision as of 13:26, 2 October 2008

Part-of-speech tagging is the process of assigning unambiguous grammatical categories[1] to words in context. The crux of the problem is that surface forms of words can often be assigned more than one part-of-speech by morphological analysis. For example in English, the word "trap" can be both a singular noun ("a trap") or a verb ("I'll trap it").

This page intends to give an overview of how part-of-speech tagging works in Apertium, primarily within the apertium-tagger, but giving a short overview of constraints (as in constraint grammar) and restrictions (as in apertium-tagger) as well.

Introduction

See also: Morphological dictionaries

Consider the following sentence in Spanish ("She came to the beach"):

Vino (noun or verb) a (pr) la (det or prn) playa (noun)

We can see that two out of the four words are ambiguous, "vino", which can be a noun ("wine") or verb ("came") and "la", which can be a determiner ("the") or a pronoun ("her" or "it"). This gives the following possibilities for the disambiguated analysis of the sentence:

Tag Gloss
det Determiner
noun Noun
prn Pronoun
pr Preposition
verb Verb
adj Adjective
noun, pr, det, noun → Wine to the beach
verb, pr, det, noun → She came to the beach
noun, pr, prn, noun → Wine to it beach
verb, pr, prn, noun → She came to it beach

As can be seen, only one of these interpretations (verb, pr, det, noun) yields the correct translation. So the task of part-of-speech tagging is to select the correct interpretation. There are a number of ways of doing this, involving both linguistically motivated rules (as constraint grammar and the Brill tagger) and statistically based (such as the TnT tagger or the ACOPOST tagger).

The tagger in Apertium (apertium-tagger) uses a combination of rules and a statistical (hidden Markov) model.

Preliminaries

Before we explain what a hidden Markov model is, we need to give some preliminaries, that is define what we mean by tagset and ambiguity class. The tagset (often shown as ) is the set of valid tags (parts of speech, etc.) to be used in the model, for example:

'<noun>', '<verb>', '<adj>',

The ambiguity classes (noted as ) of a model are the set of possible ambiguities, for example between noun and verb, or verb and adjective, e.g.

'noun|verb', 'det|prn', 'det|prn|verb',

Hidden Markov models

A hidden Markov model is made up of two matrices, representing transition and emission probabilities and a vector representing the initial probabilities of the model. The is often expressed as:

Where is the model, is the matrix of transition probabilities, is the matrix of emission probabilities and is the vector of initial probabilities. These probabilities are calculated between the tag set and the ambiguity classes from a training set. This is referred to as parameter estimation.

Parameter estimation

Maximum likelihood

The easiest way to estimate the parameters of a hidden Markov model is to use maximum likelihood (ML). This method requires a pre-tagged corpus. We're going to make a very small training corpus so that we can train a model which can be used to disambiguate the example sentence above. The corpus is much smaller than would be normally used, but will let us demonstrate step-by-step how the model is constructed and used.

Untagged Analysed Tagged
Vino a la playa Vino<verb>/<noun> a<pr> la<det>/<prn> playa<noun> Vino<verb> a<pr> la<det> playa<noun>
Voy a la casa Voy<verb> a<pr> la<det>/<prn> casa<noun>/<verb> Voy<verb> a<pr> la<det> casa<noun>
Bebe vino en casa Bebe<verb> vino<noun>/<verb> en<pr> casa<noun>/<verb> Bebe<verb> vino<noun> en<pr> casa<noun>
La casa es grande La<det>/<prn> casa<noun>/<verb> es<verb> grande<adj> La<det> casa<noun> es<verb> grande<adj>
Es una ciudad grande Es<verb> una<det>/<prn>/<verb> ciudad<noun> grande<adj> Es<verb> una<det> ciudad<noun> grande<adj>

In this corpus, the "untagged" text would come from anywhere, the "analysed" text would be the result after being passed through a morphological analyser, and the "tagged" text would be manually disambiguated from the analysed text by one or more humans.

See also

Notes

  1. Also referred to as "parts-of-speech", e.g. Noun, Verb, Adjective, Adverb, Conjunction, etc.