Difference between revisions of "Part-of-speech tagging"

From Apertium
Jump to navigation Jump to search
Line 5: Line 5:
 
This page intends to give an overview of how part-of-speech tagging works in Apertium, primarily within the <code>apertium-tagger</code>, but giving a short overview of constraints (as in [[constraint grammar]]) and restrictions (as in <code>apertium-tagger</code>) as well.
 
This page intends to give an overview of how part-of-speech tagging works in Apertium, primarily within the <code>apertium-tagger</code>, but giving a short overview of constraints (as in [[constraint grammar]]) and restrictions (as in <code>apertium-tagger</code>) as well.
   
  +
==Introduction==
==Lexical ambiguity==
 
  +
Consider the following sentence in Spanish ("She came to the beach"):
   
  +
:''Vino'' ({{sc|noun}} or {{sc|verb}}) ''a'' ({{sc|pr}}) ''la'' ({{sc|det}} or {{sc|prn}}) ''playa'' ({{sc|noun}})
After morphological analysis of a sentence, a not insignificant amount of words will have more than one analysis. For example in the following sentence:
 
 
:Vino (<code>noun</code> or <code>verb</code>) a (<code>preposition</code>) la (<code>determiner</code> or <code>pronoun</code>) playa (<code>noun</code>)
 
   
  +
We can see that two out of the four words are ambiguous, "vino", which can be a noun ("wine") or verb ("came") and "la", which can be a determiner ("the") or a pronoun ("her" or "it"). This gives the following possibilities for the disambiguated analysis of the sentence:
  +
<div style="float: right; padding: 2px;">
  +
{|class="wikitable"
  +
! Tag !! Gloss
  +
|-
  +
| {{sc|det}} || Determiner
  +
|-
  +
| {{sc|noun}} || Noun
  +
|-
  +
| {{sc|prn}} || Pronoun
  +
|-
  +
| {{sc|pr}} || Preposition
  +
|-
  +
| {{sc|verb}} || Verb
  +
|}
  +
</div>
  +
:{{sc|noun, pr, det, noun}} → Wine to the beach
  +
:{{sc|verb, pr, det, noun}} → She came to the beach
  +
:{{sc|noun, pr, prn, noun}} → Wine to it beach
  +
:{{sc|verb, pr, prn, noun}} → She came to it beach
   
  +
As can be seen, only one of these interpretations ({{sc|verb, pr, det, noun}}) yields the correct translation. So the task of part-of-speech tagging is to select the correct interpretation. There are a number of ways of doing this, involving both linguistically motivated rules (as [[constraint grammar]] and the Brill tagger) and statistically based (such as the TnT tagger or the ACOPOST tagger).
   
  +
The tagger in Apertium (<code>apertium-tagger</code>) uses a combination of rules and a statistical (hidden Markov) model.
   
 
==Hidden Markov models==
 
==Hidden Markov models==
   
A hidden Markov model is a statistical model which consists of a number of hidden states, and a number of observable states.
+
A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states.
   
 
===Ambiguity classes===
 
===Ambiguity classes===
  +
  +
In the <code>apertium-tagger</code>, and indeed in many HMM based part-of-speech taggers, the set of observable states corresponds to a set of '''ambiguity classes'''. These are simply
   
 
==Training==
 
==Training==

Revision as of 09:57, 16 September 2008

Part-of-speech tagging is the process of assigning unambiguous grammatical categories[1] to words in context. The crux of the problem is that surface forms of words can often be assigned more than one part-of-speech by morphological analysis. For example in English, the word "trap" can be both a singular noun ("a trap") or a verb ("I'll trap it").

This page intends to give an overview of how part-of-speech tagging works in Apertium, primarily within the apertium-tagger, but giving a short overview of constraints (as in constraint grammar) and restrictions (as in apertium-tagger) as well.

Introduction

Consider the following sentence in Spanish ("She came to the beach"):

Vino (noun or verb) a (pr) la (det or prn) playa (noun)

We can see that two out of the four words are ambiguous, "vino", which can be a noun ("wine") or verb ("came") and "la", which can be a determiner ("the") or a pronoun ("her" or "it"). This gives the following possibilities for the disambiguated analysis of the sentence:

Tag Gloss
det Determiner
noun Noun
prn Pronoun
pr Preposition
verb Verb
noun, pr, det, noun → Wine to the beach
verb, pr, det, noun → She came to the beach
noun, pr, prn, noun → Wine to it beach
verb, pr, prn, noun → She came to it beach

As can be seen, only one of these interpretations (verb, pr, det, noun) yields the correct translation. So the task of part-of-speech tagging is to select the correct interpretation. There are a number of ways of doing this, involving both linguistically motivated rules (as constraint grammar and the Brill tagger) and statistically based (such as the TnT tagger or the ACOPOST tagger).

The tagger in Apertium (apertium-tagger) uses a combination of rules and a statistical (hidden Markov) model.

Hidden Markov models

A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states.

Ambiguity classes

In the apertium-tagger, and indeed in many HMM based part-of-speech taggers, the set of observable states corresponds to a set of ambiguity classes. These are simply

Training

Expectation-Maximisation (EM)

Baum-Welch

Tagging

Viterbi

See also

Notes

  1. Also referred to as "parts-of-speech", e.g. Noun, Verb, Adjective, Adverb, Conjunction, etc.