Difference between revisions of "Talk:Part-of-speech tagging"

From Apertium
Jump to navigation Jump to search
(New page: ==Hidden Markov models== A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states. The hidden states correspond t...)
 
(Removing all content from page)
 
Line 1: Line 1:
 
==Hidden Markov models==
 
 
A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states. The hidden states correspond to the "correct" set of tags for a given ambiguous sentence, this would be {{sc|verb, pr, det, noun}} in the above example.
 
 
===Ambiguity classes===
 
 
In the <code>apertium-tagger</code>, and indeed in many HMM based part-of-speech taggers, the set of observable states corresponds to a set of '''ambiguity classes'''. The ambiguity classes of a model are the set of possible ambiguities (often denoted with <math>\Sigma</math>). For example, in the above example, these would be ({{sc|noun &#124; verb}}) and ({{sc|det &#124; prn}}). The preposition "a" and the noun "playa" are unambiguous therefore don't belong to an ambiguity class. These are calculated automatically from the corpus to be used for training.
 
 
===Lexical model===
 
 
===Syntactic model===
 
 
==Training==
 
 
===Preparation===
 
 
;Corpora types
 
 
{|class=wikitable
 
! Untagged !! Analysed !! Tagged
 
|-
 
| Vino a la playa || Vino{{fadetag|<verb>/<noun>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} playa{{fadetag|<noun>}} || Vino{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} playa{{fadetag|<noun>}}
 
|-
 
| Voy a la casa || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} casa{{fadetag|<noun>}}
 
|-
 
| Bebe vino en casa || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>/<verb>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>/<verb>}} || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>}}
 
|-
 
| La casa es grande || La{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}} || La{{fadetag|<det>}} casa{{fadetag|<noun>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}}
 
|-
 
| Es una ciudad grande || Es{{fadetag|<verb>}} una{{fadetag|<det>/<prn>/<verb>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}} || Es{{fadetag|<verb>}} una{{fadetag|<det>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}}
 
|}
 
 
;Ambiguity classes
 
 
* verb / noun
 
* det / prn
 
* det / prn / verb
 
 
;Transition counts
 
 
From the tagged examples we can extract the following transition counts:
 
<div style="float:left">
 
{|class=wikitable
 
! !!colspan=6|Second tag
 
|-
 
! First tag !! verb !! noun !! det !! prn !! pr !! adj
 
|-
 
| '''verb''' || 0 || 1 || 1 || 0 || 2 || 1
 
|-
 
| '''noun''' || 1 || 0 || 0 || 0 || 1 || 1
 
|-
 
| '''det''' || 0 || 4 || 0 || 0 || 0 || 0
 
|-
 
| '''prn''' || 0 || 0 || 0 || 0 || 0 || 0
 
|-
 
| '''pr''' || 0 || 1 || 2 || 0 || 0 || 0
 
|-
 
| '''adj''' || 0 || 0 || 0 || 0 || 0 || 0
 
|-
 
|}
 
</div>
 
<div style="float: right">
 
{|class=wikitable
 
! !!colspan=6|Part-of-speech
 
|-
 
! Word !! verb !! noun !! det !! prn !! pr !! adj
 
|-
 
| vino || 1 || 1 || 0 || 0 || 0 || 0
 
|-
 
| a || 0 || 0 || 0 || 0 || 2 || 0
 
|-
 
| la || 0 || 0 || 3 || 0 || 0 || 0
 
|-
 
| playa || 0 || 1 || 0 || 0 || 0 || 0
 
|-
 
| voy || 1 || 0 || 0 || 0 || 0 || 0
 
|-
 
| casa || 0 || 3 || 0 || 0 || 0 || 0
 
|-
 
| es || 2 || 0 || 0 || 0 || 0 || 0
 
|-
 
| grande || 0 || 0 || 0 || 0 || 0 || 2
 
|-
 
| una || 0 || 0 || 1 || 0 || 0 || 0
 
|-
 
| ciudad || 0 || 1 || 0 || 0 || 0 || 0
 
|-
 
| bebo || 1 || 0 || 0 || 0 || 0 || 0
 
|-
 
| en || 0 || 0 || 0 || 0 || 1 || 0
 
|-
 
 
|}
 
</div>
 
<br style="clear:both"/>
 
 
===Parameter estimation===
 
 
<math>P(det|pr) > P(prn|pr)</math>
 
 
The <code>apertium-tagger</code> has two options for training (or estimating the parameters of) an HMM. The choice of either depends on the availability of a pre-disambiguated corpus. The maximum-likelihood estimation (ML) algorithm relies on having a pre-tagged corpus.
 
 
====Maximum likelihood estimation (MLE)====
 
 
====Baum-Welch====
 
 
==Tagging==
 
 
===Viterbi===
 

Latest revision as of 16:19, 25 March 2009