Difference between revisions of "User:Gang Chen/GSoC 2013 Application: "Sliding Window PoS Tagger""

From Apertium
Jump to navigation Jump to search
 
(199 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Name ==
+
== Title ==
  +
'''A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium'''
'''Name:''' Gang Chen
 
   
  +
== Why is it you are interested in machine translation? ==
== Contact Information ==
 
  +
I major in Natural Language Processing, and became interested in Machine Translation after taking a course on it. 2 years ago, I started to intern in a statistical machine translation team of an Internet company. Since then, I began to be in deeper touch with the classic IBM models, the n-gram language models, various decoding algorithms, and many excellent open source tool kits. I was greatly attracted by the ideas and applications of machine translation. Reading the papers and making the tool kits to work properly brought me much fun, and I still remember the sleepless night thinking about the neural network language model. A language is a very complicated system, and translating a sentence from one language to another has always been a challenging task. Meanwhile, the translation need is growing drastically, so developing better machine translation systems can also benefit the society. I'm interested in it being a challenging problem to study and the promising real life applications.
'''Email:''' pkuchengang@gmail.com
 
   
  +
== Why is it you are interested in the Apertium project? ==
'''IRC:''' Gang
 
  +
I came across the Apertium project a year ago, when I knew about last year's GSoC.
   
  +
Firstly, the rule-based methodology of it attracted me most. I studied linguistics in my undergraduate period, and during that time I formed a interest in analyzing language phenomena. The language system is organized in a structural way in its nature, so studying the underlying rules of it should be the most direct effort to unveil the mysteries of the system. Although currently the rules in Apertium are approximations to the real structural rules in a language system, many language pairs have proved a high translation quality. So hopefully, we can expect them to have a higher quality with more sophisticated rules integrated.
'''GitHub Repo:''' https://github.com/elephantgcc
 
   
  +
Secondly, Apertium is open in using statistical methods. Although it is a rule-based project, there are some useful statistical models as its components, and they help to improve the translation quality indeed. So I think many developers are willing to contribute to the project, whether they prefer rule-based or empirical methods.
== Why is it you are interested in machine translation? ==
 
I am majored in Natural Language Processing, and after taking a course on Machine Translation, I became intrested in it. It sits in the center of many NLP techniques, and has a very promising application in real life.
 
   
  +
Thirdly, Apertium is an open source project, so people with the necessary skills can join in and think of ideas to improve it for more and better machine translation services. And what's also important is that the services can be used by people all around the world.
2 years ago, I started to intern in a statistical machine translation team of an Internet company. Since then, I began to be in deeper touch with the classic IBM models, the n-gram language models, various decoding algorithms, and many excellent open source toolkits. I was greatly attracted by the ideas and applications of machine translation. Reading the papers and making the toolkits to work properly brought me so much fun, that I still remember the sleepless night thinking about the nerual network language model.
 
   
  +
== Which of the published tasks are you interested in? What do you plan to do? ==
A language is such a complicated system, that translating a sentence from one language to another has always been a challenging task. On the other hand, with the translation need drastically growing, developing better machine translation systems can also benifit the society.
 
  +
The project '''"Sliding-Window Part-of-Speech Tagger (SWPoST)"'''.
   
  +
=== Part I: How does a SWPoST work? ===
   
  +
==== Overview ====
== Why is it that you are interested in the Apertium project? ==
 
  +
The task of a PoS tagger is to choose the best PoS tag for each position of the input. According to the Apertium documentation, the input to the tagger is a "ambiguity class sequence", and the output is a normal fine-grained tag sequence.
I came across the Apertium project a year ago, when I knew about last year's GSoC. The rule-based methodology of it attracted me most. The language system is organized in a structural way in its nature, so studying the underlying rules of it should be the most direct effort to unveil the mysteries of the system. Linguistic knoledge plays a central role in language processing problems, whether it's by emprical statistics or rational analysis. So I think it is important to play with the "colorless green ideas" within a language.
 
   
  +
Firstly, a ''Category'' is generated by grouping similar fine-grained tags together, because the information provided by a morphological analyzer is too detailed for the PoS disambiguation.
Besides that, Apertium is an open source project, so ordinary people can join in and try ideas to improve it for more and better machine translation services. And what's also important is that they can be used by people all around the world.
 
   
  +
For example, the morphological analysis (or the normal fine-grained tags) of the input sentence '''"I have seen it"''' is:
   
  +
<pre>
== Which of the published tasks are you interested in? What do you plan to do? ==
 
  +
^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ \
I have a great interest in the project "Sliding-window part-of-speech tagger".
 
  +
^seen/see<vblex><pp>$ ^it/prpers<prn><subj><p3><nt><sg>/prpers<prn><obj><p3><nt><sg>$
Currently the PoS tagger for Apertium is a HMM tagger, while the project aims to the implementation of a sliding-window PoS tagger, for better quality and efficiency.
 
  +
</pre>
   
  +
According to the tagger configuration file (or the TSX file), "num.*" will be grouped into a category called "NUM", "prn.subj.*" will be grouped into a category called "PRNSUBJ", etc. So we finally get a category sequence:
My plans are basiclly as follows:
 
   
  +
<pre>
1) read the paper and get a full understanding to the algorithm, and familiarize myself with the Apertium pipeline.
 
  +
^I/NUM/PRNSUBJ/$ ^have/VHAVEINF/VHAVEPRES/INF/PRES$ ^seen/VLEXPP$ ^it/PRNSUBJ/PRNOBJ/$
2) implement a supervised training version of the algorithm. Implement a FST without a minimization.
 
  +
</pre>
3) implenent an unsupervised training version of the algorithm. Implement a FST with a minimization.
 
4) integrate FORBID rules into the implementation.
 
5) optionally, re-study the algorithm to explore further possible improvements mentioned in the paper.
 
   
  +
Secondly, a partition of the vocabulary is established so that <math>w_i \equiv w_j</math> if and only if both are assigned the same subset of categories. Each class of the partition is called an '''ambiguity class'''. So an ambiguity class is a subset of all categories. The ambiguity sequence will be the input for a PoS tagger.
== Proposal Title ==
 
Application for the Apertium Project "Sliding-window part-of-speech tagger" for GSoC 2013.
 
   
  +
==== Problem Formulation ====
== Why Google and Apertium should sponsor it? How and who it will benefit in society? ==
 
The PoS tagger plays an important role in the Apertium translation pipeline. A sliding-window PoS tagger can improve the quality and efficiency of tagging, than the current HMM tagger. So its implementation will benifit both the Apertium developers and users.
 
   
  +
For the convenience of discussion, we will formulate the PoS tagging problem in Apertium as the following example:
I would be really thankful if Google and Apertium would sponsor this project. I think I have the knowledge and skills to accomplish the task, plus the enthusiasm for machine translation. And Apertium is a very nice community that I would like to stay with for long time.
 
   
  +
Input: ambiguity class sequence
== Work plan ==
 
The task of this project is to implement a new PoS tagger. The outline of the developing procedure are mentioned above as basic plans.
 
The detailed work plan is as follows:
 
   
  +
<math>\sigma[1] \sigma[2] ... \sigma[L]</math>
Community bonding period: Get further contact with the community, and familiarize with myself with Apertium platform and the project.
 
   
  +
where each <math>\sigma[i]</math> has a set of candidate PoS tags, and let <math> T(\sigma[i]) </math> to denote this set.
Week 01: Learn about code and file formats for the tagger. Learn from and run the current HMM tagger to get a baseline.
 
   
  +
Output: the best PoS tag sequence
Week 02: Implement the supervised training algorithm. Get training corpus ready for experiments.
 
   
  +
<math>\gamma[1] \gamma[2] ... \gamma[L]</math>
Week 03: Learn about finite-state transducers. Implement a FST to use the new tagger.
 
   
  +
where each <math>\gamma[i] \in T(\sigma[i])</math>.
Week 04: Continue implementation. Make tests, check for bugs, and documentation.
 
   
  +
==== Tagging procedure ====
Deliverable #1: Finish a supervised training algorithm, with FST usable. The tagger has good a quality and efficiency.
 
  +
The core idea of a SWPoST is very intuitive. For the input ambiguity class <math>\sigma[i]</math>, the tagger predicts the PoS tag by looking at the '''left context''' and '''right context''' of it. In the paper ([http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf 2004 paper]), the algorithm works best when <math>N_{(-)}</math> (left context length) is set to 1, and <math>N_{(+)}</math> (right context length) also to 1.
   
  +
For example, if we have the input as:
Week 05: Re-read the paper and documentation. Write a first version of the unsupervised algorithm.
 
   
  +
<pre>
Week 06: Continue with the unsupervised algorithm. Survey techniques on how to minimize FST.
 
  +
^A/a/b$ ^B/x/y$ ^C/m/n$
  +
</pre>
   
  +
where <math>A</math>, <math>B</math> and <math>C</math> are the input ambiguity classes(<math>\sigma</math>), while <math>a</math>, <math>b</math>, <math>x</math>, <math>y</math>, <math>m</math> and <math>n</math> are candidate PoS tags(<math>\gamma</math>). The context of <math>B</math> should be <math>A</math>_<math>C</math>.
Week 07: Implement the FST minimization.
 
   
  +
In the tagging procedure, the algorithm firstly gets a possible list of tags that <math>B</math> may be assigned, i.e. <math>x</math> and <math>y</math>; then consult the model for the context <math>A</math>_<math>C</math> on which tag it mostly supports, for example <math>P(x|A\_C)</math> is bigger than <math>P(y|A\_C)</math>, then the PoS tag for <math>B</math> should be <math>x</math>.
Week 08: Continue implementation. Make tests, check for bugs, and documentation.
 
   
  +
==== Unsupervised training procedure====
Deliverable #2: Finish a unsupervised training algorithm, with FST minimization. The tagger has a good quality and efficiency.
 
  +
In the training procedure, the key problem is to figure out the probability that a particular context supports a particular tag. The paper provides an iterative method to make approximations to that probability, based on the unsupervised data.
   
  +
For computational convenience, the algorithm estimates <math>\tilde{n}_{A\_x\_C}</math>, the number of times that tag <math>x</math> would appear in the context <math>A</math>_<math>C</math>, instead of <math>P(x|A\_C)</math>. They are interchangeable.
Week 09: Study different ways to incorperate FORBID rules. Implement the FORBID restrictions.
 
   
  +
We use a very simple example to give a description on how the iterative procedure runs.
Week 10: Continue implementation. Make tests, check for bugs, and documentation.
 
   
  +
Suppose we have only two training cases:
Deliverable #3: Finish an XML-based format for writing FORBID rules.
 
   
  +
<pre>
Week 11: Study further possible improvements.
 
  +
(1) A ^B/x/y$ C
  +
(2) A ^D/x/y/z$ C
  +
</pre>
   
  +
where we only focus on the context <math>A</math>_<math>C</math>. Whether <math>A</math> or <math>C</math> is ambiguous or not will not affect the context.
Week 12: Check the code and documentation.
 
   
  +
Firstly, for each training case, we assume that the possibility for each candidate tag is equal, because we don't have any other information. That is, according to the first case, <math>\tilde{n}_{A\_x\_C} = 1/2 </math>, and <math>\tilde{n}_{A\_y\_C} = 1/2 </math>; according to the second case, <math>\tilde{n}_{A\_x\_C} = 1/3 </math>, <math>\tilde{n}_{A\_y\_C} = 1/3 </math> and <math>\tilde{n}_{A\_z\_C} = 1/3 </math>. This is the initial step (or '''0-th''' step) of the algorithm, and we finally get:
Deliverable #Final: A full implementation of the Sliding-window PoS tagger in Apertium.
 
   
  +
<math>\tilde{n}^{[0]}_{A\_x\_C} = 1 * 1/2 + 1 * 1/3 = 5/6</math>
== Include time needed to think, to program, to document and to disseminate. ==
 
Generally and totally:
 
   
  +
<math>\tilde{n}^{[0]}_{A\_y\_C} = 1 * 1/2 + 1 * 1/3 = 5/6</math>
Programming and thinking will mainly take 8 weeks.
 
   
  +
<math>\tilde{n}^{[0]}_{A\_z\_C} = 1 * 1/3 = 1/3</math>
Documentation, surveying, and disseminatation will share the rest 4 weeks.
 
  +
  +
Then, we can use these counts to estimate the following probabilities:
  +
  +
<math>p(x|A\_B\_C) = \frac{5/6}{5/6 + 5/6} = 1/2</math>
  +
  +
<math>p(y|A\_B\_C) = \frac{5/6}{5/6 + 5/6} = 1/2</math>
  +
  +
<math>p(x|A\_D\_C) = \frac{5/6}{5/6 + 5/6 + 1/3} = 5/12</math>
  +
  +
<math>p(y|A\_D\_C) = \frac{5/6}{5/6 + 5/6 + 1/3} = 5/12</math>
  +
  +
<math>p(z|A\_D\_C) = \frac{1/3}{5/6 + 5/6 + 1/3} = 1/6</math>
  +
  +
And then, using these probabilities we can re-estimate the counts as expectations:
  +
  +
<math>\tilde{n}^{[1]}_{A\_x\_C} = 1 * 1/2 + 1 * 5/12 = 11/12</math>
  +
  +
<math>\tilde{n}^{[1]}_{A\_y\_C} = 1 * 1/2 + 1 * 5/12 = 11/12</math>
  +
  +
<math>\tilde{n}^{[1]}_{A\_z\_C} = 1 * 1/6 = 1/6</math>
  +
  +
Until now, we have finished one iteration, with all the <math>\tilde{n}_{A\_\gamma\_C}</math> values updated. And we can observe that <math>\tilde{n}^{[1]}_{A\_x\_C}</math> and <math>\tilde{n}^{[1]}_{A\_y\_C}</math> are growing bigger, and <math>\tilde{n}^{[1]}_{A\_z\_C}</math> is growing smaller, compared to those of <math>\tilde{n}^{[0]}_{A\_x\_C}</math>, <math>\tilde{n}^{[0]}_{A\_y\_C}</math> and <math>\tilde{n}^{[0]}_{A\_z\_C}</math> respectively. While the total count still remains the same, <math>5/6 + 5/6 + 1/3 = 11/12 + 11/12 + 1/6 = 2</math>.
  +
  +
Repeating the above procedure, the iterative process will converge to a point, with:
  +
  +
<math>\tilde{n}_{A\_x\_C} = 1</math>
  +
  +
<math>\tilde{n}_{A\_y\_C} = 1</math>
  +
  +
<math>\tilde{n}_{A\_z\_C} = 0</math>
  +
  +
  +
This is a simple example to show the iterative procedure of the unsupervised training algorithm. Real data may be more complex, but the core mechanism remains the same.
  +
  +
==== Store the probabilities in a clever way====
  +
The parameters in a SWPoS tagger are probabilities like:
  +
  +
<math>P(x|A\_C)</math>
  +
  +
<math>P(y|A\_C)</math>
  +
  +
where <math>x</math> and <math>y</math> are PoS tags, while <math>A\_C</math> is the context.
  +
  +
However, in the tagging procedure, when the tagger sees the context <math>A\_C</math>, it picks the tag that has the highest probability. So actually it is the '''ranking''' that makes the final decision.
  +
  +
"Store the probabilities in a clever way" means not to store the probabilities themselves, but to store their ranking based on a context, for example:
  +
  +
<pre>
  +
A_C x y
  +
</pre>
  +
  +
It is a more efficient way, even than using the double compression for storing probabilities as the HMM tagger does. This also makes it easier for the linguistic reading and edition.
  +
  +
===Part II: FORBID and ENFORCE restrictions ===
  +
  +
==== Overview ====
  +
FORBID and ENFORCE restrictions are two ways for putting linguistic knowledge into the Part-of-Speech tagger. They are described in the tagger configuration file, or the TSX file. FORBID restrictions '''forbid certain tag sequences''' from appearing, while ENFORCE restrictions '''enforce certain tag sequences''' from appearing.
  +
  +
Here is a brief example to show how the TSX file describes FORBID and ENFORCE restrictions.
  +
  +
<pre>
  +
<forbid>
  +
<label-sequence>
  +
<label-item label="VSER"/>
  +
<label-item label="PRNOBJ"/>
  +
</label-sequence>
  +
...
  +
</forbid>
  +
</pre>
  +
  +
<pre>
  +
<enforce-rules>
  +
<enforce-after label="VHAVEPAST">
  +
<label-set>
  +
<label-item label="VLEXPP"/>
  +
<label-item label="VSERPP"/>
  +
<label-item label="ADV"/>
  +
<label-item label="NOT"/>
  +
<label-item label="PRNSUBJ"/>
  +
</label-set>
  +
</enforce-after>
  +
...
  +
</enforce-rules>
  +
</pre>
  +
  +
==== The Light SWPoST (LSWPoST) ====
  +
A SW tagger make decisions according to the context, namely the left and right ambiguity class sequences. Because FORBID and ENFORCE restrictions are based on the tag sequences, so if the tagger wants to integrate the restrictions, it has to know the previous decisions. However, the original SW tagger make decisions independently with the previous tagging history.
  +
  +
A Light Sliding-Window(LSW) PoS tagger was proposed in the paper "Parameter reduction in unsupervisedly trained sliding-window part-of-speech taggers (Enrique Sánchez-Villamil, Mikel L. Forcada and Rafael C. Carrasco, RANLP, 2005)"([http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil05p.pdf 2005 paper]), which reduces drastically the parameters of a SW tagger with a negligible performance drop-down. The main idea of the paper is the following approximation:
  +
  +
<math>P( \gamma |C_{(-)}[t] \sigma[t] C_{(+)}[t]) \approx \sum_{E_{(-)} \in T^{'}(C_{(-)}[t]) \atop E_{(+)} \in T^{'}(C_{(+)}[t])} P(E_{(-)} \gamma E_{(+)}|C_{(-)}[t] \sigma[t] C_{(+)}[t]) </math>
  +
  +
where <math>E_{(-)}</math> and <math>E_{(+)}</math> are left context tag sequence and right context tag sequence respectively, <math>T^{'}(.)</math> denotes a function returning candidate tag sequences of a context. Note that the formula is an approximation, because it assumes different <math>E_{(-)} \gamma E_{(+)}</math>-s are independent.
  +
  +
As an example, we still consider the training cases above,
  +
  +
<pre>
  +
^A/a/b$ ^B/x/y$ ^C/m/n$
  +
^A/a/b$ ^D/x/y/z$ ^C/m/n$
  +
</pre>
  +
  +
The LSW tagger does the following approximation:
  +
  +
<math>P(x|ABC) \approx P(axm|ABC) + P(axn|ABC) + P(bxm|ABC) + P(bxn|ABC)</math>
  +
  +
<math>P(y|ABC) \approx P(aym|ABC) + P(ayn|ABC) + P(bym|ABC) + P(byn|ABC)</math>
  +
  +
...
  +
  +
Based on these probabilities, the LSW tagger uses the same iterative procedure for unsupervised training as the SW tagger's. Finally, the LSW training procedure will end with a series of parameters like <math>\tilde{n}_{a\_x\_m}</math>, <math>\tilde{n}_{a\_x\_n}</math>, etc., in contrast to the SW parameters being <math>\tilde{n}_{A\_x\_C}</math>, <math>\tilde{n}_{A\_y\_C}</math>, etc.
  +
  +
In this way, the number of parameters of the LSW tagger is reduced to <math>|\Gamma|^{N_{(-)} + N_{(+)} + 1}</math>, compared to <math>|\Sigma|^{N_{(-)} + N_{(+)}} |\Gamma|</math> of the SW tagger. Usually, the size of the tag set (<math>|\Gamma|</math>) is relatively small, while the number of the ambiguity classes (<math>|\Sigma|</math>) is relatively large.
  +
  +
==== How to integrate restrictions using a LSWPoST? ====
  +
Because the LSW tagger works on the tag space, so the FORBID and ENFORCE rules can be integrated '''during''' the training procedure.
  +
  +
For example, if the tag sequence
  +
  +
<pre>
  +
a x
  +
</pre>
  +
  +
is forbidden, then, the parameters <math>\tilde{n}_{a\_x\_m}</math>, and <math>\tilde{n}_{a\_x\_n}</math> will be zero.
  +
  +
Because the iteration is a process if '''successive multiplicative corrections''' to the empirical counts, those zeroes will be maintained during and after training.
  +
  +
ENFORCE restrictions can be integrated similarly as FORBID rules.
  +
  +
For example, if the tag sequence
  +
  +
<pre>
  +
b z
  +
</pre>
  +
  +
is enforced( "z" is enforced-after "b"), then, the parameters <math>\tilde{n}_{b\_x\_m}</math>, <math>\tilde{n}_{b\_x\_n}</math>, <math>\tilde{n}_{b\_y\_m}</math> and <math>\tilde{n}_{b\_y\_n}</math> will all be zero. And the zeroes will also be maintained.
  +
  +
==== Two possible solutions using the LSW tagger====
  +
To summarize all the features of the SW tagger and the LSW tagger, there are currently two promising solutions to the FORBID and ENFORCE implementations.
  +
  +
* 1) Training = LSW + SW, Tagging = SW
  +
  +
:* This solution trains a LSW tagger before actually running the SW training, and uses the LSW tagger to estimate a better starting point, than the default starting point with averaged probabilities, for the SW tagger. Then a standard SW tagger is trained.
  +
:* However, in the tagging procedure, in order to satisfy FORBID and ENFORCE restrictions, the SW tagger has to know the previous tagging history.
  +
  +
* 2) Training = LSW, Tagging = LSW
  +
  +
:* This solution trains a LSW tagger, and uses it in the tagging procedure. It can integrate FORBID and ENFORCE restrictions naturally.
  +
:* However, the LSW tagger has a little drop-down in performance, compared to the full SW tagger.
  +
  +
===Part III: General plans for the whole project ===
  +
* Unsupervised version implementation
  +
* FORBID and ENFORCE restrictions implementation
  +
* Minimized FST implementation
  +
* Optionally, supervised version implementation
  +
  +
== Why Google and Apertium should sponsor it? How and who it will benefit in society? ==
  +
The PoS tagger plays an important role in the Apertium translation pipeline. A sliding-window PoS tagger can improve the quality of tagging, than the current HMM tagger. Besides, the algorithm is intuitive, and easy to understand and modify. So its implementation will benifit both the Apertium developers and users.
  +
  +
== Work plan ==
  +
=== Coding Challenge ===
  +
* Task description: [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Sliding-window_part-of-speech_tagger coding clallenge for "Sliding-Window Part-of-Speech Tagger"]
  +
  +
:* Write a filter that reads in the output of Apertium morphological analyser and writes out either a random one (-r) or the first one (-f) of the lexical form for each surface form in a new format, respecting superblanks.
  +
  +
:* Write a roundtrip converter that reads the outputted first or random lexical form and regenerate the input.
  +
  +
* Solution: [https://github.com/elephantgcc/CodingChallenge CodingChallenge]
  +
  +
* README: [https://github.com/elephantgcc/CodingChallenge/blob/master/README.md README]
  +
  +
=== Community bonding period ===
  +
* Get further contact with the community, and familiarize myself with the Apertium platform and the project.
  +
* Make preparations, learn necessary knowledge that will be used in the implementation.
  +
  +
=== Week Plan ===
  +
{| class="wikitable" border="1"
  +
|-
  +
!week
  +
!date
  +
!plans
  +
|-
  +
|Week 01
  +
|06.17-06.23
  +
|Implement the unsupervised version of the algorithm.
  +
|-
  +
|Week 02
  +
|06.24-06.30
  +
|Implement the unsupervised version of the algorithm.
  +
|-
  +
|Week 03
  +
|07.01-07.07
  +
|Store the probability data in a clever way, allowing reading and edition using linguistic knowledge. Test using linguistic edition.
  +
|-
  +
|Week 04
  +
|07.08-07.14
  +
|Make tests, check for bugs, and documentation.
  +
|-
  +
| '''Deliverable #1'''
  +
|
  +
|'''A SWPoST that works with probabilities.'''
  +
|-
  +
|Week 05
  +
|07.15-07.21
  +
|Implement FORBID restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
  +
|-
  +
|Week 06
  +
|07.22-07.28
  +
|Implement FORBID restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
  +
|-
  +
|Week 07
  +
|07.29-08.04
  +
|Implement ENFORCE restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
  +
|-
  +
|Week 08
  +
|08.05-08.11
  +
|Make tests, check for bugs, and documentation.
  +
|-
  +
|'''Deliverable #2'''
  +
|
  +
|'''A SWPoST that works with FORBID and ENFORCE restrictions.'''
  +
|-
  +
|Week 09
  +
|08.12-08.18
  +
|Implement the minimized FST version.
  +
|-
  +
|Week 10
  +
|08.19-08.25
  +
|Refine code. Optionally, implement the supervised version of the algorithm.
  +
|-
  +
|Week 11
  +
|08.26-09.01
  +
|Make tests, check the code and documentation. Optionally, Study further possible improvements.
  +
|-
  +
|Week 12
  +
|09.02-09.08
  +
|Make tests, check the code and documentation.
  +
|-
  +
|'''Deliverable #3'''
  +
|
  +
|'''A full implementation of the SWPoST in Apertium.'''
  +
|}
   
 
== List your skills and give evidence of your qualifications. ==
 
== List your skills and give evidence of your qualifications. ==
 
I am currently a 2-nd year postgraduate majoring in Natural Language Processing.
 
I am currently a 2-nd year postgraduate majoring in Natural Language Processing.
   
During the recent 2 years, I have been interning in a statistical machine translation team of an Internet company.
+
During the recent 2 years, I have been interning in a statistical machine translation team. There, I brought up 2 language pairs (Spanish-Chinese and Russian-Chinese) into online services. With that chance, I explored the whole pipeline of a statistical machine translation system, including bilingual data crawling and filtering , word alignment, model extraction, and parameter tuning, etc. Such experience may help me to understand Apertium more quickly.
   
  +
I also used Map-Reduce framework to process large amounts of text, to improve the N-gram language models. Finally it gained a +0.5 BLEU quality improvement and did well in a human blind test with significantly more better translated sentences for the Chinese-English pair. During the project, I watched and analysed a lot of raw text and N-gram data, read many papers on data selection and domain adaptation, and conducted various kinds of experiments on text filtering and cleaning. Thanks to that project, it made me form a habit of analyzing bad cases and data. Such experience may help me to debug and analyze cases more efficiently.
There, I brought up 2 language pairs (Spanish-Chinese and Russian Chinese) into online services (http://fanyi.youdao.com) independently. With that chance, I explored the whole pipeline of a statistical machine translation system, including bilingual data crawling and filtering , word alignment, model extraction, and parameter tunning, etc.
 
   
  +
Besides, I also took part in a project for speeding up the translation decoder, and gained a significant 2x efficiency improvement at very little quality cost. We watched a lot of bad cases, analyze them linguistically, and finally we made a good balance between translation quality and speed. Such experience may help in terms of the implementation efficiency.
I also used Map-Reduce framework to processe large amounts of text, to improve the Ngram language models, which a significant +0.5 BLEU quality improvement for Chinese-English translation. During that, I watched and analysed a lot of raw text and ngram data, read many papers on data selection and domain adaptation, and conducted various kinds of experiments on text filtering and cleaning. Thanks to that project, it made me form a habit of analysing bad cases and data, instead of blindly doing the black-box parameter tuning without thinking.
 
   
  +
During my undergraduate period, I took courses on linguistics, mathematics and computer science, where I got the basic knowledge about linguistic analysis and programming. I also took part in the development of some NLP systems, and especially, an supervised HMM PoS tagger written in C++. The linguistic background and the coding experiences can help me to better understand the SWPoST and benefit the implementation task.
Besides, I also took part in a project for speeding up the translation decoder, and gained a significant 2x efficiency improvement at very little quality cost. We watched a lot of bad cases, analyse them linguistically, and finally we made a good balance between translation qualitiy and speed.
 
   
  +
As to the coding skills, I have 3 years' experience programming in Java and C++, and I am also familiar with basic Python and Bash for dealing with light-weight tasks.
During my undergraduate period, I took courses on linguistics, mathematics and computer science, where I got the basic knowledge about languistic analyses and programming. I also took part in the development of some NLP systems, such as the construction of Chinese Concept Dictionary, which is a a WordNet-like Chinese ontology, a lexical similarity computation software written in Java, and especially, an supervised HMM PoS tagger written in C++, which I think may help to the implementation of the Sliding-window PoS tagger.
 
   
  +
This is the first time that I take part in an open source project, and I'm excited about it! I have been using open source toolkits for long, for example, the Moses machine translation toolkit, KenLM, OpenNLP toolkit, etc., and they all brought great help to me. I'd be happy to make some contributions to the open source community.
As to the coding skills, I have 3 years' experience programming in Java and C++, and I am also familiar with basic Python and Shell for dealing with light-weight tasks.
 
   
 
With the knowledge and experiences on natural language processing, I am confident in accomplishing the task well.
 
With the knowledge and experiences on natural language processing, I am confident in accomplishing the task well.
   
  +
== List any non-Summer-of-Code plans you have for the Summer ==
This is the first time that I take part in an open source project, and I'm excited about it! I have been using open source toolkits for long, for example, the Moses machine translation toolkit, Srilm language model, OpenNLP toolkit, etc, and they all brought great help to me. I'd be happy to make some contributions to the open source community, and try best to adapt myself to the open source developing environment.
 
  +
Parallel with the Apertium project, I will spare some time doing preparations for the coming job-season starting from September.
  +
  +
However, the Apertium project will be the main focus.
  +
  +
[[Category:GSoC 2013 Student proposals|Gang_Chen]]

Latest revision as of 10:56, 28 May 2018

Title[edit]

A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium

Why is it you are interested in machine translation?[edit]

I major in Natural Language Processing, and became interested in Machine Translation after taking a course on it. 2 years ago, I started to intern in a statistical machine translation team of an Internet company. Since then, I began to be in deeper touch with the classic IBM models, the n-gram language models, various decoding algorithms, and many excellent open source tool kits. I was greatly attracted by the ideas and applications of machine translation. Reading the papers and making the tool kits to work properly brought me much fun, and I still remember the sleepless night thinking about the neural network language model. A language is a very complicated system, and translating a sentence from one language to another has always been a challenging task. Meanwhile, the translation need is growing drastically, so developing better machine translation systems can also benefit the society. I'm interested in it being a challenging problem to study and the promising real life applications.

Why is it you are interested in the Apertium project?[edit]

I came across the Apertium project a year ago, when I knew about last year's GSoC.

Firstly, the rule-based methodology of it attracted me most. I studied linguistics in my undergraduate period, and during that time I formed a interest in analyzing language phenomena. The language system is organized in a structural way in its nature, so studying the underlying rules of it should be the most direct effort to unveil the mysteries of the system. Although currently the rules in Apertium are approximations to the real structural rules in a language system, many language pairs have proved a high translation quality. So hopefully, we can expect them to have a higher quality with more sophisticated rules integrated.

Secondly, Apertium is open in using statistical methods. Although it is a rule-based project, there are some useful statistical models as its components, and they help to improve the translation quality indeed. So I think many developers are willing to contribute to the project, whether they prefer rule-based or empirical methods.

Thirdly, Apertium is an open source project, so people with the necessary skills can join in and think of ideas to improve it for more and better machine translation services. And what's also important is that the services can be used by people all around the world.

Which of the published tasks are you interested in? What do you plan to do?[edit]

The project "Sliding-Window Part-of-Speech Tagger (SWPoST)".

Part I: How does a SWPoST work?[edit]

Overview[edit]

The task of a PoS tagger is to choose the best PoS tag for each position of the input. According to the Apertium documentation, the input to the tagger is a "ambiguity class sequence", and the output is a normal fine-grained tag sequence.

Firstly, a Category is generated by grouping similar fine-grained tags together, because the information provided by a morphological analyzer is too detailed for the PoS disambiguation.

For example, the morphological analysis (or the normal fine-grained tags) of the input sentence "I have seen it" is:

^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ \
^seen/see<vblex><pp>$ ^it/prpers<prn><subj><p3><nt><sg>/prpers<prn><obj><p3><nt><sg>$

According to the tagger configuration file (or the TSX file), "num.*" will be grouped into a category called "NUM", "prn.subj.*" will be grouped into a category called "PRNSUBJ", etc. So we finally get a category sequence:

^I/NUM/PRNSUBJ/$ ^have/VHAVEINF/VHAVEPRES/INF/PRES$ ^seen/VLEXPP$ ^it/PRNSUBJ/PRNOBJ/$

Secondly, a partition of the vocabulary is established so that if and only if both are assigned the same subset of categories. Each class of the partition is called an ambiguity class. So an ambiguity class is a subset of all categories. The ambiguity sequence will be the input for a PoS tagger.

Problem Formulation[edit]

For the convenience of discussion, we will formulate the PoS tagging problem in Apertium as the following example:

Input: ambiguity class sequence

where each has a set of candidate PoS tags, and let to denote this set.

Output: the best PoS tag sequence

where each .

Tagging procedure[edit]

The core idea of a SWPoST is very intuitive. For the input ambiguity class , the tagger predicts the PoS tag by looking at the left context and right context of it. In the paper (2004 paper), the algorithm works best when (left context length) is set to 1, and (right context length) also to 1.

For example, if we have the input as:

^A/a/b$ ^B/x/y$ ^C/m/n$

where , and are the input ambiguity classes(), while , , , , and are candidate PoS tags(). The context of should be _.

In the tagging procedure, the algorithm firstly gets a possible list of tags that may be assigned, i.e. and ; then consult the model for the context _ on which tag it mostly supports, for example is bigger than , then the PoS tag for should be .

Unsupervised training procedure[edit]

In the training procedure, the key problem is to figure out the probability that a particular context supports a particular tag. The paper provides an iterative method to make approximations to that probability, based on the unsupervised data.

For computational convenience, the algorithm estimates , the number of times that tag would appear in the context _, instead of . They are interchangeable.

We use a very simple example to give a description on how the iterative procedure runs.

Suppose we have only two training cases:

(1) A ^B/x/y$   C
(2) A ^D/x/y/z$ C

where we only focus on the context _. Whether or is ambiguous or not will not affect the context.

Firstly, for each training case, we assume that the possibility for each candidate tag is equal, because we don't have any other information. That is, according to the first case, , and ; according to the second case, , and . This is the initial step (or 0-th step) of the algorithm, and we finally get:

Then, we can use these counts to estimate the following probabilities:

And then, using these probabilities we can re-estimate the counts as expectations:

Until now, we have finished one iteration, with all the values updated. And we can observe that and are growing bigger, and is growing smaller, compared to those of , and respectively. While the total count still remains the same, .

Repeating the above procedure, the iterative process will converge to a point, with:


This is a simple example to show the iterative procedure of the unsupervised training algorithm. Real data may be more complex, but the core mechanism remains the same.

Store the probabilities in a clever way[edit]

The parameters in a SWPoS tagger are probabilities like:

where and are PoS tags, while is the context.

However, in the tagging procedure, when the tagger sees the context , it picks the tag that has the highest probability. So actually it is the ranking that makes the final decision.

"Store the probabilities in a clever way" means not to store the probabilities themselves, but to store their ranking based on a context, for example:

A_C x y

It is a more efficient way, even than using the double compression for storing probabilities as the HMM tagger does. This also makes it easier for the linguistic reading and edition.

Part II: FORBID and ENFORCE restrictions[edit]

Overview[edit]

FORBID and ENFORCE restrictions are two ways for putting linguistic knowledge into the Part-of-Speech tagger. They are described in the tagger configuration file, or the TSX file. FORBID restrictions forbid certain tag sequences from appearing, while ENFORCE restrictions enforce certain tag sequences from appearing.

Here is a brief example to show how the TSX file describes FORBID and ENFORCE restrictions.

 <forbid>
    <label-sequence>
      <label-item label="VSER"/>
      <label-item label="PRNOBJ"/>
    </label-sequence>
    ...
</forbid>
 <enforce-rules>
    <enforce-after label="VHAVEPAST">
      <label-set>
        <label-item label="VLEXPP"/>
        <label-item label="VSERPP"/>
        <label-item label="ADV"/>
        <label-item label="NOT"/>
        <label-item label="PRNSUBJ"/>
      </label-set>
    </enforce-after>
    ...
</enforce-rules>

The Light SWPoST (LSWPoST)[edit]

A SW tagger make decisions according to the context, namely the left and right ambiguity class sequences. Because FORBID and ENFORCE restrictions are based on the tag sequences, so if the tagger wants to integrate the restrictions, it has to know the previous decisions. However, the original SW tagger make decisions independently with the previous tagging history.

A Light Sliding-Window(LSW) PoS tagger was proposed in the paper "Parameter reduction in unsupervisedly trained sliding-window part-of-speech taggers (Enrique Sánchez-Villamil, Mikel L. Forcada and Rafael C. Carrasco, RANLP, 2005)"(2005 paper), which reduces drastically the parameters of a SW tagger with a negligible performance drop-down. The main idea of the paper is the following approximation:

where and are left context tag sequence and right context tag sequence respectively, denotes a function returning candidate tag sequences of a context. Note that the formula is an approximation, because it assumes different -s are independent.

As an example, we still consider the training cases above,

^A/a/b$ ^B/x/y$   ^C/m/n$
^A/a/b$ ^D/x/y/z$ ^C/m/n$

The LSW tagger does the following approximation:

...

Based on these probabilities, the LSW tagger uses the same iterative procedure for unsupervised training as the SW tagger's. Finally, the LSW training procedure will end with a series of parameters like , , etc., in contrast to the SW parameters being , , etc.

In this way, the number of parameters of the LSW tagger is reduced to , compared to of the SW tagger. Usually, the size of the tag set () is relatively small, while the number of the ambiguity classes () is relatively large.

How to integrate restrictions using a LSWPoST?[edit]

Because the LSW tagger works on the tag space, so the FORBID and ENFORCE rules can be integrated during the training procedure.

For example, if the tag sequence

a x

is forbidden, then, the parameters , and will be zero.

Because the iteration is a process if successive multiplicative corrections to the empirical counts, those zeroes will be maintained during and after training.

ENFORCE restrictions can be integrated similarly as FORBID rules.

For example, if the tag sequence

b z

is enforced( "z" is enforced-after "b"), then, the parameters , , and will all be zero. And the zeroes will also be maintained.

Two possible solutions using the LSW tagger[edit]

To summarize all the features of the SW tagger and the LSW tagger, there are currently two promising solutions to the FORBID and ENFORCE implementations.

  • 1) Training = LSW + SW, Tagging = SW
  • This solution trains a LSW tagger before actually running the SW training, and uses the LSW tagger to estimate a better starting point, than the default starting point with averaged probabilities, for the SW tagger. Then a standard SW tagger is trained.
  • However, in the tagging procedure, in order to satisfy FORBID and ENFORCE restrictions, the SW tagger has to know the previous tagging history.
  • 2) Training = LSW, Tagging = LSW
  • This solution trains a LSW tagger, and uses it in the tagging procedure. It can integrate FORBID and ENFORCE restrictions naturally.
  • However, the LSW tagger has a little drop-down in performance, compared to the full SW tagger.

Part III: General plans for the whole project[edit]

  • Unsupervised version implementation
  • FORBID and ENFORCE restrictions implementation
  • Minimized FST implementation
  • Optionally, supervised version implementation

Why Google and Apertium should sponsor it? How and who it will benefit in society?[edit]

The PoS tagger plays an important role in the Apertium translation pipeline. A sliding-window PoS tagger can improve the quality of tagging, than the current HMM tagger. Besides, the algorithm is intuitive, and easy to understand and modify. So its implementation will benifit both the Apertium developers and users.

Work plan[edit]

Coding Challenge[edit]

  • Write a filter that reads in the output of Apertium morphological analyser and writes out either a random one (-r) or the first one (-f) of the lexical form for each surface form in a new format, respecting superblanks.
  • Write a roundtrip converter that reads the outputted first or random lexical form and regenerate the input.

Community bonding period[edit]

  • Get further contact with the community, and familiarize myself with the Apertium platform and the project.
  • Make preparations, learn necessary knowledge that will be used in the implementation.

Week Plan[edit]

week date plans
Week 01 06.17-06.23 Implement the unsupervised version of the algorithm.
Week 02 06.24-06.30 Implement the unsupervised version of the algorithm.
Week 03 07.01-07.07 Store the probability data in a clever way, allowing reading and edition using linguistic knowledge. Test using linguistic edition.
Week 04 07.08-07.14 Make tests, check for bugs, and documentation.
Deliverable #1 A SWPoST that works with probabilities.
Week 05 07.15-07.21 Implement FORBID restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
Week 06 07.22-07.28 Implement FORBID restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
Week 07 07.29-08.04 Implement ENFORCE restrictions, using LSW. Using the same options and TSX file as the HMM tagger.
Week 08 08.05-08.11 Make tests, check for bugs, and documentation.
Deliverable #2 A SWPoST that works with FORBID and ENFORCE restrictions.
Week 09 08.12-08.18 Implement the minimized FST version.
Week 10 08.19-08.25 Refine code. Optionally, implement the supervised version of the algorithm.
Week 11 08.26-09.01 Make tests, check the code and documentation. Optionally, Study further possible improvements.
Week 12 09.02-09.08 Make tests, check the code and documentation.
Deliverable #3 A full implementation of the SWPoST in Apertium.

List your skills and give evidence of your qualifications.[edit]

I am currently a 2-nd year postgraduate majoring in Natural Language Processing.

During the recent 2 years, I have been interning in a statistical machine translation team. There, I brought up 2 language pairs (Spanish-Chinese and Russian-Chinese) into online services. With that chance, I explored the whole pipeline of a statistical machine translation system, including bilingual data crawling and filtering , word alignment, model extraction, and parameter tuning, etc. Such experience may help me to understand Apertium more quickly.

I also used Map-Reduce framework to process large amounts of text, to improve the N-gram language models. Finally it gained a +0.5 BLEU quality improvement and did well in a human blind test with significantly more better translated sentences for the Chinese-English pair. During the project, I watched and analysed a lot of raw text and N-gram data, read many papers on data selection and domain adaptation, and conducted various kinds of experiments on text filtering and cleaning. Thanks to that project, it made me form a habit of analyzing bad cases and data. Such experience may help me to debug and analyze cases more efficiently.

Besides, I also took part in a project for speeding up the translation decoder, and gained a significant 2x efficiency improvement at very little quality cost. We watched a lot of bad cases, analyze them linguistically, and finally we made a good balance between translation quality and speed. Such experience may help in terms of the implementation efficiency.

During my undergraduate period, I took courses on linguistics, mathematics and computer science, where I got the basic knowledge about linguistic analysis and programming. I also took part in the development of some NLP systems, and especially, an supervised HMM PoS tagger written in C++. The linguistic background and the coding experiences can help me to better understand the SWPoST and benefit the implementation task.

As to the coding skills, I have 3 years' experience programming in Java and C++, and I am also familiar with basic Python and Bash for dealing with light-weight tasks.

This is the first time that I take part in an open source project, and I'm excited about it! I have been using open source toolkits for long, for example, the Moses machine translation toolkit, KenLM, OpenNLP toolkit, etc., and they all brought great help to me. I'd be happy to make some contributions to the open source community.

With the knowledge and experiences on natural language processing, I am confident in accomplishing the task well.

List any non-Summer-of-Code plans you have for the Summer[edit]

Parallel with the Apertium project, I will spare some time doing preparations for the coming job-season starting from September.

However, the Apertium project will be the main focus.