User:Gang Chen/GSoC 2013 Summary

From Apertium
Jump to navigation Jump to search

Project

A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium

The Full Paper/Documentation

The full paper/documentation describing the mechanism, implementation, and experimental results of the SW and LSW tagger could be found here:

A Light Sliding Window Part-of-Speech Tagger for the Apertium Free/Open-Source Machine Translation Platform

Summary

Although the title shows that the new tagger is a sliding-window drop-in replacement for the HMM tagger, actually we did three things further:

1) Besides the Sliding-Window (SW) tagger, we also implemented the Light Sliding-Window (LSW) tagger, and successfully incorporated rules into the LSW tagger.

2) Instead of replacing the original HMM, we implemented the LSW tagger as an extension, so that the functionalities of the original HMM are fully kept.

3) We wrote a paper about the implementation of the SW and LSW tagger, and conducted a series of experiments.

The SW tagger

The Main Idea

Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma = \{\gamma_1, \gamma_2, ...\gamma_{|\Gamma|}\}} be the tag set, and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle W = \{w_1, w_2, ...\}} be the words to be tagged.

A partition of W is established so that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w_i \equiv w_j} if and only if both are assigned the same subset of tags, where each class of the partition is called an ambiguity class.

Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Sigma = \{\sigma_1, \sigma_2, ..., \sigma_{\Sigma}\}} be the collection of ambiguity classes, where each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma_i} is an ambiguity class.

LetFailed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T : \Sigma \rightarrow 2^{\Gamma}} be the function returning the collection Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T(\gamma)} of PoS tags for an ambiguity class Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} .


The PoS tagging problem may be formulated as follows: given a text Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w[1]w[2]...w[L] \in W^{+}} , each word Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w[t]} is assigned (using a lexicon and a morphological analyzer) an ambiguity class Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma[t] \in \Sigma} to obtain the ambiguously tagged text Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma[1]\sigma[2]...\sigma[t] \in \Sigma^{+}} ; the task of a PoS tagger is to get a tag sequenceFailed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma[1]\gamma[2]...\gamma[t] \in \Gamma^{+}} as correct as possible, that is, the one that maximizes a the probability of that tag sequence given the word sequence:


Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma^{*}[1]...\gamma^{*}[L] = argmax_{\gamma[t] \in T(\gamma[t])} P(\gamma[1]...\gamma[L]|\sigma[1]...\sigma[L]) }


The core of the SW tagger is to use neighboring ambiguity classes to approximate the dependencies:


Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\gamma[1]...\gamma[L] | \sigma[1]...\sigma[L]) = \prod_{t = 1}^{t = L} p(\gamma[t] | C_{(-)}\gamma[t]C_{(+)})}


where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t = 1..L} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(-)}} is the left context of length Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(-)}} (e.g. if Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(-)} = 1} , then Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(-)} = \sigma[t - 1]} ), Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(+)}} is the left context of length Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_{(+)}} . Usually, at the left most and right most of the ambiguity sequence, specific sentence marks “ # ” are added in order to make the formula work in general.

Unsupervised Training

Please refer to the full documentation.

The LSW taggger

The Main Difference from the SW Tagger

The SW tagger tags a word by looking around the neighbouring ambiguity classes,and has therefore a number of parameters in Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(|\Sigma|^{N_{(-)} + N_{(+)}} |\Gamma|)} . The LSW tagger tags a word by looking around the neighbouring tags, and therefore it has a number of parameters in Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(|\Gamma|^{N_{(-)} + N_{(+)} + 1})} . Usually the tag set size Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\Gamma|} is significantly smaller than the combinational ambiguity class size Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\Sigma|} . In this way, the parameters could be effectively reduced.

The LSW approximates the best tag as follows:


Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma^{*} = argmax_{\gamma \in T(\sigma[t])} {\sum_{ E_{(-)} \in T'(C_{(-)}[t]) \atop E_{(+)} \in T'(C_{(+)}[t])} P(E_{(-)} \gamma E_{(+)}|C_{(-)}[t] \sigma[t] C_{(+)}[t]) }}

where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T' : \Sigma^{*} \rightarrow \Gamma^{*}} , returning the set of tag sequences for an ambiguity sequence; Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E_{(-)}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E\{(+)\}} are the left and right tag sequence respectively.

Forbid and Enforce Rules for the LSW Tagger

The following is a fragment of the forbid and enforce rules.

<forbid>
	<label-sequence> 
		<label-item label="VHAVEPP"/> 
		<label-item label="INF"/> 
	</label-sequence> 
	……
</forbid>
		
<enforce-rules>
	</enforce-after> 
    		<enforce-after label="VHAVEINF"> 
		<label-set> 
			<label-item label="VLEXPP"/> 
			<label-item label="VSERPP"/> 
			<label-item label="ADV"/> 
		</label-set> 
	</enforce-after> 
	……
</enforce-rules>

The rules can not be used in a SW tagger, because the parameters for a SW tagger are Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tilde{n}_{C_{(-)} \gamma C_{(+)}}} , that is, the counts of a certain tag Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} appearing between the ambiguity class context Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(-)}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_{(+)}} . However, it is quite easy to incorporate the rules into an LSW tagger, because the parameters for an LSW tagger are Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tilde{n}_{E_{(-)} \gamma E_{(+)}}} , that is, the (effective) counts of a certain tag Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} appearing between the tag sequence context Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E_{(-)}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E_{(+)}} .

The rules for LSW can be introduced right after the initial step, in a similar way as in the HMM tagger. For a tag sequence in the parameter space, if any consecutive two tags match a forbid rule or fail to match an enforce rule, the underlying parameter Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tilde{n}_{E_{(-)} \gamma E_{(+)}}} will be given a starting value of zero.

Unsupervised Training

Please refer to the full documentation.