User:Gang Chen/GSoC 2013 Summary
A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium
The Full Paper/Documentation
The full paper/documentation describing the mechanism, implementation, and experimental results of the SW and LSW tagger could be found here:
A summary of the work in the form of a scientific paper is also available here:
- Gang Chen, Mikel L. Forcada. A Light Sliding-Window Part-of-Speech Tagger for the Apertium Free/Open-Source Machine Translation Platform, in http://arxiv.org/abs/1509.05517
Although the title shows that the new tagger is a sliding-window drop-in replacement for the HMM tagger, actually we did three things further:
1) Besides the Sliding-Window (SW) tagger, we also implemented the Light Sliding-Window (LSW) tagger, and successfully incorporated rules into the LSW tagger.
2) Instead of replacing the original HMM, we implemented the LSW tagger as an extension, so that the functionalities of the original HMM are fully kept.
3) We wrote a paper about the implementation of the SW and LSW tagger, and conducted a series of experiments.
The SW tagger
The Main Idea
Let be the tag set, and be the words to be tagged.
A partition of W is established so that if and only if both are assigned the same subset of tags, where each class of the partition is called an ambiguity class.
Let be the collection of ambiguity classes, where each is an ambiguity class.
Let be the function returning the collection of PoS tags for an ambiguity class .
The PoS tagging problem may be formulated as follows: given a text , each word is assigned (using a lexicon and a morphological analyzer) an ambiguity class to obtain the ambiguously tagged text ; the task of a PoS tagger is to get a tag sequence as correct as possible, that is, the one that maximizes a the probability of that tag sequence given the word sequence:
The core of the SW tagger is to use neighboring ambiguity classes to approximate the dependencies:
where , is the left context of length (e.g. if , then ), is the left context of length . Usually, at the left most and right most of the ambiguity sequence, specific sentence marks “ # ” are added in order to make the formula work in general.
Please refer to the full documentation.
The LSW taggger
The Main Difference from the SW Tagger
The SW tagger tags a word by looking around the neighbouring ambiguity classes,and has therefore a number of parameters in . The LSW tagger tags a word by looking around the neighbouring tags, and therefore it has a number of parameters in . Usually the tag set size is significantly smaller than the combinational ambiguity class size . In this way, the parameters could be effectively reduced.
The LSW approximates the best tag as follows:
where , returning the set of tag sequences for an ambiguity sequence; and are the left and right tag sequence respectively.
Forbid and Enforce Rules for the LSW Tagger
The following is a fragment of the forbid and enforce rules.
<forbid> <label-sequence> <label-item label="VHAVEPP"/> <label-item label="INF"/> </label-sequence> …… </forbid> <enforce-rules> </enforce-after> <enforce-after label="VHAVEINF"> <label-set> <label-item label="VLEXPP"/> <label-item label="VSERPP"/> <label-item label="ADV"/> </label-set> </enforce-after> …… </enforce-rules>
The rules can not be used in a SW tagger, because the parameters for a SW tagger are , that is, the counts of a certain tag appearing between the ambiguity class context and . However, it is quite easy to incorporate the rules into an LSW tagger, because the parameters for an LSW tagger are , that is, the (effective) counts of a certain tag appearing between the tag sequence context and .
The rules for LSW can be introduced right after the initial step, in a similar way as in the HMM tagger. For a tag sequence in the parameter space, if any consecutive two tags match a forbid rule or fail to match an enforce rule, the underlying parameter will be given a starting value of zero.
Please refer to the full documentation.