Difference between revisions of "Talk:Part-of-speech tagging"
Jump to navigation
Jump to search
(New page: ==Hidden Markov models== A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states. The hidden states correspond t...) |
(Removing all content from page) |
||
Line 1: | Line 1: | ||
− | |||
− | ==Hidden Markov models== |
||
− | |||
− | A hidden Markov model (HMM) is a statistical model which consists of a number of hidden states, and a number of observable states. The hidden states correspond to the "correct" set of tags for a given ambiguous sentence, this would be {{sc|verb, pr, det, noun}} in the above example. |
||
− | |||
− | ===Ambiguity classes=== |
||
− | |||
− | In the <code>apertium-tagger</code>, and indeed in many HMM based part-of-speech taggers, the set of observable states corresponds to a set of '''ambiguity classes'''. The ambiguity classes of a model are the set of possible ambiguities (often denoted with <math>\Sigma</math>). For example, in the above example, these would be ({{sc|noun | verb}}) and ({{sc|det | prn}}). The preposition "a" and the noun "playa" are unambiguous therefore don't belong to an ambiguity class. These are calculated automatically from the corpus to be used for training. |
||
− | |||
− | ===Lexical model=== |
||
− | |||
− | ===Syntactic model=== |
||
− | |||
− | ==Training== |
||
− | |||
− | ===Preparation=== |
||
− | |||
− | ;Corpora types |
||
− | |||
− | {|class=wikitable |
||
− | ! Untagged !! Analysed !! Tagged |
||
− | |- |
||
− | | Vino a la playa || Vino{{fadetag|<verb>/<noun>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} playa{{fadetag|<noun>}} || Vino{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} playa{{fadetag|<noun>}} |
||
− | |- |
||
− | | Voy a la casa || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} || Voy{{fadetag|<verb>}} a{{fadetag|<pr>}} la{{fadetag|<det>}} casa{{fadetag|<noun>}} |
||
− | |- |
||
− | | Bebe vino en casa || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>/<verb>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>/<verb>}} || Bebe{{fadetag|<verb>}} vino{{fadetag|<noun>}} en{{fadetag|<pr>}} casa{{fadetag|<noun>}} |
||
− | |- |
||
− | | La casa es grande || La{{fadetag|<det>/<prn>}} casa{{fadetag|<noun>/<verb>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}} || La{{fadetag|<det>}} casa{{fadetag|<noun>}} es{{fadetag|<verb>}} grande{{fadetag|<adj>}} |
||
− | |- |
||
− | | Es una ciudad grande || Es{{fadetag|<verb>}} una{{fadetag|<det>/<prn>/<verb>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}} || Es{{fadetag|<verb>}} una{{fadetag|<det>}} ciudad{{fadetag|<noun>}} grande{{fadetag|<adj>}} |
||
− | |} |
||
− | |||
− | ;Ambiguity classes |
||
− | |||
− | * verb / noun |
||
− | * det / prn |
||
− | * det / prn / verb |
||
− | |||
− | ;Transition counts |
||
− | |||
− | From the tagged examples we can extract the following transition counts: |
||
− | <div style="float:left"> |
||
− | {|class=wikitable |
||
− | ! !!colspan=6|Second tag |
||
− | |- |
||
− | ! First tag !! verb !! noun !! det !! prn !! pr !! adj |
||
− | |- |
||
− | | '''verb''' || 0 || 1 || 1 || 0 || 2 || 1 |
||
− | |- |
||
− | | '''noun''' || 1 || 0 || 0 || 0 || 1 || 1 |
||
− | |- |
||
− | | '''det''' || 0 || 4 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | '''prn''' || 0 || 0 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | '''pr''' || 0 || 1 || 2 || 0 || 0 || 0 |
||
− | |- |
||
− | | '''adj''' || 0 || 0 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | |} |
||
− | </div> |
||
− | <div style="float: right"> |
||
− | {|class=wikitable |
||
− | ! !!colspan=6|Part-of-speech |
||
− | |- |
||
− | ! Word !! verb !! noun !! det !! prn !! pr !! adj |
||
− | |- |
||
− | | vino || 1 || 1 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | a || 0 || 0 || 0 || 0 || 2 || 0 |
||
− | |- |
||
− | | la || 0 || 0 || 3 || 0 || 0 || 0 |
||
− | |- |
||
− | | playa || 0 || 1 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | voy || 1 || 0 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | casa || 0 || 3 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | es || 2 || 0 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | grande || 0 || 0 || 0 || 0 || 0 || 2 |
||
− | |- |
||
− | | una || 0 || 0 || 1 || 0 || 0 || 0 |
||
− | |- |
||
− | | ciudad || 0 || 1 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | bebo || 1 || 0 || 0 || 0 || 0 || 0 |
||
− | |- |
||
− | | en || 0 || 0 || 0 || 0 || 1 || 0 |
||
− | |- |
||
− | |||
− | |} |
||
− | </div> |
||
− | <br style="clear:both"/> |
||
− | |||
− | ===Parameter estimation=== |
||
− | |||
− | <math>P(det|pr) > P(prn|pr)</math> |
||
− | |||
− | The <code>apertium-tagger</code> has two options for training (or estimating the parameters of) an HMM. The choice of either depends on the availability of a pre-disambiguated corpus. The maximum-likelihood estimation (ML) algorithm relies on having a pre-tagged corpus. |
||
− | |||
− | ====Maximum likelihood estimation (MLE)==== |
||
− | |||
− | ====Baum-Welch==== |
||
− | |||
− | ==Tagging== |
||
− | |||
− | ===Viterbi=== |