Difference between revisions of "User:Gang Chen/GSoC 2013 Summary"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:


= Summary =
= Summary =
Although the title shows that it's a sliding-window drop-in replacement for the HMM tagger, actually we did three things further:
Although the title shows that the new tagger is a sliding-window drop-in replacement for the HMM tagger, actually we did three things further:


1) Besides the SW tagger, we also implemented the Light Sliding-Window (LSW) tagger, and successfully incorporated rules into the LSW tagger.
1) Besides the Sliding-Window (SW) tagger, we also implemented the Light Sliding-Window (LSW) tagger, and successfully incorporated rules into the LSW tagger.


2) Instead of replacing the original HMM, we implemented the LSW tagger as an extension, so that the functionalities of the original HMM are fully kept.
2) Instead of replacing the original HMM, we implemented the LSW tagger as an extension, so that the functionalities of the original HMM are fully kept.
Line 12: Line 12:


== The SW tagger ==
== The SW tagger ==

Let <math>\Gamma = \{\gamma_1, \gamma_2, ...\gamma_{|\Gamma|}\}</math> be the tag set, and <math>W = \{w_1, w_2, ...\}</math> be the words to be tagged.

A partition of W is established so that <math>w_i \equiv w_j</math> if and only if both are assigned the same subset of tags, where each class of the partition is called an ambiguity class.

Let <math>\Sigma = \{\sigma_1, \sigma_2, ..., \sigma_{\Sigma}\}</math>be the collection of ambiguity classes, where each <math>\sigma_i</math>is an ambiguity class.

Let<math>T : \Sigma \rightarrow 2^{\Gamma}</math> be the function returning the collection <math>T(\gamma)</math> of PoS tags for an ambiguity class <math>\gamma</math>.


The PoS tagging problem may be formulated as follows: given a text <math>w[1]w[2]...w[L] \in W^{+}</math>, each word <math>w[t]</math> is assigned (using a lexicon and a morphological analyzer) an ambiguity class <math>\sigma[t] \in \Sigma</math> to obtain the ambiguously tagged text <math>\sigma[1]\sigma[2]...\sigma[t] \in \Sigma^{+}</math>; the task of a PoS tagger is to get a tag sequence<math>\gamma[1]\gamma[2]...\gamma[t] \in \Gamma^{+}</math> as correct as possible, that is, the one that maximizes a the probability of that tag sequence given the word sequence:

<math>\gamma^{*}[1]...\gamma^{*}[L] = argmax_{\gamma[t] \in T(\gamma[t])} P(\gamma[1]...\gamma[L]|\sigma[1]...\sigma[L]) </math>

The core of the SW tagger is to use neighboring ambiguity classes to approximate the dependencies:

<math>P(\gamma[1]...\gamma[L] | \sigma[1]...\sigma[L]) = \prod_{t = 1}^{t = L} p(\gamma[t] | C_{(-)}\gamma[t]C_{(+)})</math>

where <math>t = 1..L</math>, <math>C_{(-)}</math> is the left context of length <math>N_{(-)}</math> (e.g. if <math>N_{(-)} = 1</math>, then <math>C_{(-)} = \sigma[t - 1]</math>), <math>C_{(+)}</math> is the left context of length <math>N_{(+)}</math>. Usually, at the left most and right most of the ambiguity sequence, specific sentence marks “ # ” are added in order to make the formula work in general.







Revision as of 03:26, 26 September 2013

Project

A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium

Summary

Although the title shows that the new tagger is a sliding-window drop-in replacement for the HMM tagger, actually we did three things further:

1) Besides the Sliding-Window (SW) tagger, we also implemented the Light Sliding-Window (LSW) tagger, and successfully incorporated rules into the LSW tagger.

2) Instead of replacing the original HMM, we implemented the LSW tagger as an extension, so that the functionalities of the original HMM are fully kept.

3) We wrote a paper about the implementation of the SW and LSW tagger, and conducted a series of experiments.

The SW tagger

Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma = \{\gamma_1, \gamma_2, ...\gamma_{|\Gamma|}\}} be the tag set, and be the words to be tagged.

A partition of W is established so that if and only if both are assigned the same subset of tags, where each class of the partition is called an ambiguity class.

Let be the collection of ambiguity classes, where each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma_i} is an ambiguity class.

LetFailed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T : \Sigma \rightarrow 2^{\Gamma}} be the function returning the collection Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T(\gamma)} of PoS tags for an ambiguity class Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} .


The PoS tagging problem may be formulated as follows: given a text Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w[1]w[2]...w[L] \in W^{+}} , each word Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w[t]} is assigned (using a lexicon and a morphological analyzer) an ambiguity class Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma[t] \in \Sigma} to obtain the ambiguously tagged text Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma[1]\sigma[2]...\sigma[t] \in \Sigma^{+}} ; the task of a PoS tagger is to get a tag sequenceFailed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma[1]\gamma[2]...\gamma[t] \in \Gamma^{+}} as correct as possible, that is, the one that maximizes a the probability of that tag sequence given the word sequence:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma^{*}[1]...\gamma^{*}[L] = argmax_{\gamma[t] \in T(\gamma[t])} P(\gamma[1]...\gamma[L]|\sigma[1]...\sigma[L]) }

The core of the SW tagger is to use neighboring ambiguity classes to approximate the dependencies:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\gamma[1]...\gamma[L] | \sigma[1]...\sigma[L]) = \prod_{t = 1}^{t = L} p(\gamma[t] | C_{(-)}\gamma[t]C_{(+)})}

where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t = 1..L} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(-)}} is the left context of length Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(-)}} (e.g. if Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(-)} = 1} , then Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(-)} = \sigma[t - 1]} ), Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(+)}} is the left context of length Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(+)}} . Usually, at the left most and right most of the ambiguity sequence, specific sentence marks “ # ” are added in order to make the formula work in general.





The LSW taggger