Difference between revisions of "Tagger"
(Created page with ''''Tagger''' is usually short for part-of-speech tagger, a program which takes an ambiguous sequence of morphologically analysed text and chooses the most probable analysis. ==S…') |
|||
Line 1: | Line 1: | ||
'''Tagger''' is usually short for part-of-speech tagger, a program which takes an ambiguous sequence of morphologically analysed text and chooses the most probable analysis. |
'''Tagger''' is usually short for part-of-speech tagger, a program which takes an ambiguous sequence of morphologically analysed text and chooses the most probable analysis. |
||
Given the following ambiguous input (from "tengo una idea") |
|||
<pre> |
|||
^tengo/tener<vblex><pri><p1><sg>$ ^una/uno<prn><tn><f><sg>/uno<det><ind><f><sg>/unir<vblex><prs><p3><sg>/unir<vblex><prs><p1><sg>/unir<vblex><imp><p3><sg>$ ^idea/idea<n><f><sg>/idear<vblex><pri><p3><sg>/idear<vblex><imp><p2><sg>$ |
|||
</pre> |
|||
a good tagger would end up with |
|||
<pre> |
|||
^tener<vblex><pri><p1><sg>$ ^uno<det><ind><f><sg>$ ^idea<n><f><sg>$ |
|||
</pre> |
|||
The program <code>apertium-tagger</code> achieves this by using a Hidden Markov Model, a statistical model using bigrams (trigram training is also possible). Training of <code>apertium-tagger</code> can be supervised or unsupervised; there is also [[target-language target training]] where training is based on how good the translations given by the tagging are, using a target-language language model. If a certain bigram sequence is impossible, one may explicitly tell the tagger this with FORBID or ENFORCE rules. |
|||
Some language pairs use [[Constraint Grammar]] (CG) to remove more readings before <code>apertium-tagger</code>; CG lets you write rule-based taggers which allows more complex rules. |
|||
==See also== |
==See also== |
Revision as of 07:20, 24 March 2010
Tagger is usually short for part-of-speech tagger, a program which takes an ambiguous sequence of morphologically analysed text and chooses the most probable analysis.
Given the following ambiguous input (from "tengo una idea")
^tengo/tener<vblex><pri><p1><sg>$ ^una/uno<prn><tn><f><sg>/uno<det><ind><f><sg>/unir<vblex><prs><p3><sg>/unir<vblex><prs><p1><sg>/unir<vblex><imp><p3><sg>$ ^idea/idea<n><f><sg>/idear<vblex><pri><p3><sg>/idear<vblex><imp><p2><sg>$
a good tagger would end up with
^tener<vblex><pri><p1><sg>$ ^uno<det><ind><f><sg>$ ^idea<n><f><sg>$
The program apertium-tagger
achieves this by using a Hidden Markov Model, a statistical model using bigrams (trigram training is also possible). Training of apertium-tagger
can be supervised or unsupervised; there is also target-language target training where training is based on how good the translations given by the tagging are, using a target-language language model. If a certain bigram sequence is impossible, one may explicitly tell the tagger this with FORBID or ENFORCE rules.
Some language pairs use Constraint Grammar (CG) to remove more readings before apertium-tagger
; CG lets you write rule-based taggers which allows more complex rules.