User:Gang Chen/GSoC 2013 Application: "Sliding Window PoS Tagger"

From Apertium
Jump to navigation Jump to search

Bold text=== Here is how it works ===

Overview

The task of a PoS tagger is to choose the best PoS tag for each position of the input. According to the documentation, the input to the tagger is a "ambiguity class sequence", and the output is a normal fine-grained tag sequence.

Firstly, a "Category" is generated by grouping similar fine-grained tags together, because the information provided by a morphological analyser is too detailed for the PoS disambiguation.

For example, the morphological analysis (or the noraml fine-grained tags) of the input sentence "I have seen it" is:

^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ ^seen/see<vblex><pp>$ ^it/prpers<prn><subj><p3><nt><sg>/prpers<prn><obj><p3><nt><sg>$

According to the tagger configuration file (or the TSX file), "num.*" will be grouped into a category called "NUM", "prn.subj.*" will be grouped into a category called "PRNSUBJ", etc. So we finally get a category sequence:

^I/NUM/PRNSUBJ/$ ^have/VHAVEINF/VHAVEPRES/INF/PRES$ ^seen/VLEXPP$ ^it/PRNSUBJ/PRNOBJ/$

Secondly, a partition of the vocabulary is established so that if and only if both are assigned the same subset of categories. Each class of the partition is called a ambiguity class.

Problem Formulation

For the convenience of discussion, we will formulate the PoS tagging problem in Apertium as the folllowing example:

Input: ambiguity class sequence

where each has a set of candidate PoS tags, and let to denote this set, namely .

Output: the best PoS tag sequence

where each .

Sliding Window PoS tagger

The core idea of a SWPoS is easy to understand. For the input ambiguity class , the tagger predicts the PoS tag by looking at the left context and right context of it. In the paper, the algorithm works best when (left context length) is set to 1, and (right context length) also to 1.

For example, if we have the input as:

^A/a/b$ ^B/x/y$ ^C/m/n$

where , and are the input -s, while , , , , and are candidate -s. The context of should be _.

  • In the tagging procedure, the algorithm firstly gets a possible list of tags that may be assigned, i.e. and ; then consult the model for the context _ on which tag it mostly supports, for example is bigger than , then the PoS tag for should be .
  • In the training procedure, the key problem is to figure out the probability that a particular context supports a particular tag, based on the unsupervised data. The paper provides an iterative method to make approximations to that probability.

Unsupervised training for the SWPoS tagger

For computational convenience, the algorithm estimates , the number of times that tag would appear in the context _, instead of . They are interchangable.

We use a very simple example to give a description on how the iterative procedure runs.

Suppose we had only two training cases:

(1) A ^B/x/y$   C
(2) A ^D/x/y/z$ C

where we only focus on the context _. Whether or is disambiguous or not will not affect the context.

Firstly, for each training case, we assume that the possibility for any tag is equal, because we don't have any other information. That is, according to the first case, , and ; according to the second case, , and . This is the initil step (or 0-th step) of the algorithm, and we finally get:

Then, we can use these counts to estimate the following probabilities:

And then, using these probabilities we can re-estimate the counts as expectations:

Until now, we have finished one iteration, with all the values updated. And we can observe that and are growing bigger, and is growing smaller, while the total count still remains the same, .

Repeating the above procedure, the process will converge to a point, with:


This is a simple example to show the iterative procedure of the unsupervised training algorithm. Real data may be more complex, but the core mechanism remains the same.