Difference between revisions of "User:Gang Chen/GSoC 2013 Application: "Sliding Window PoS Tagger""

Revision as of 13:40, 27 April 2013

Proposal Title

A sliding-window drop-in replacement for the HMM part-of-speech tagger in Apertium

Name

Name: Gang Chen

Contact Information

Email: pkuchengang@gmail.com

IRC: Gang

GitHub Repo: https://github.com/elephantgcc

Why is it you are interested in machine translation?

I am majored in Natural Language Processing. I became intrested in Machine Translation after taking a course on it.

2 years ago, I started to intern in a statistical machine translation team of an Internet company. Since then, I began to be in deeper touch with the classic IBM models, the n-gram language models, various decoding algorithms, and many excellent open source toolkits. I was greatly attracted by the ideas and applications of machine translation. Reading the papers and making the toolkits to work properly brought me much fun, and I still remember the sleepless night thinking about the nerual network language model.

A language is a very complicated system, and translating a sentence from one language to another has always been a challenging task. Meanwhile, the translation need is growing drastically, so developing better machine translation systems can also benifit the society.

Why is it you are interested in the Apertium project?

I came across the Apertium project a year ago, when I knew about last year's GSoC.

Firstly, The rule-based methodology of it attracted me most. I studied linguistics in my undergraduate period, and during that time I formed a interest in analysing language phenomena. The language system is organized in a structural way in its nature, so studying the underlying rules of it should be the most direct effort to unveil the mysteries of the system. Although currently the rules in Apertium are approximations to the real structural rules in a language system, many language pairs have proved a high translation quality. So hopefully, we can expect them to have a higer quality with more sophiscated rules integrated.

Secondly, Apertium is open in using statistical methods. Although it is a rule-based project, there are some useful statistical models as its components, and they help to improve the translation quality indeed. So I think more developers are willing to contribute to the project.

Thirdly, Apertium is an open source project, so people with the necessary skills can join in and think of ideas to improve it for more and better machine translation services. And what's also important is that they can be used by people all around the world.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in the project "Sliding-window part-of-speech tagger".

Currently the PoS tagger for Apertium is a HMM tagger, while the project aims to the implementation of a sliding-window PoS tagger, for a better quality and efficiency.

Sliding Window PoS Tagger

Overview

The task of a PoS tagger is to choose the best PoS tag for each position of the input. According to the documentation, the input to the tagger is a "ambiguity class sequence", and the output is a normal fine-grained tag sequence.

Firstly, a "Category" is generated by grouping similar fine-grained tags together, because the information provided by a morphological analyser is too detailed for the PoS disambiguation.

For example, the morphological analysis (or the noraml fine-grained tags) of the input sentence "I have seen it" is:

^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ \
^seen/see<vblex><pp>$ ^it/prpers<prn><subj><p3><nt><sg>/prpers<prn><obj><p3><nt><sg>$

According to the tagger configuration file (or the TSX file), "num.*" will be grouped into a category called "NUM", "prn.subj.*" will be grouped into a category called "PRNSUBJ", etc. So we finally get a category sequence:

^I/NUM/PRNSUBJ/$ ^have/VHAVEINF/VHAVEPRES/INF/PRES$ ^seen/VLEXPP$ ^it/PRNSUBJ/PRNOBJ/$

Secondly, a partition of the vocabulary is established so that $w_{i}\equiv w_{j}$ if and only if both are assigned the same subset of categories. Each class of the partition is called a ambiguity class.

Problem Formulation

For the convenience of discussion, we will formulate the PoS tagging problem in Apertium as the folllowing example:

Input: ambiguity class sequence

$\sigma [1]\sigma [2]...\sigma [L]$

where each $\sigma [i]$ has a set of candidate PoS tags, and let $T$ to denote this set, namely $T(\sigma [i])$ .

Output: the best PoS tag sequence

$\gamma [1]\gamma [2]...\gamma [L]$

where each $\gamma [i]\in T(\sigma [i])$ .

Tagging procedure

The core idea of a SWPoS is easy to understand. For the input ambiguity class $\sigma [i]$ , the tagger predicts the PoS tag by looking at the left context and right context of it. In the paper, the algorithm works best when $N_{(-)}$ (left context length) is set to 1, and $N_{(+)}$ (right context length) also to 1.

For example, if we have the input as:

^A/a/b$ ^B/x/y$ ^C/m/n$

where $A$ , $B$ and $C$ are the input $\sigma$ -s, while $a$ , $b$ , $x$ , $y$ , $m$ and $n$ are candidate $\gamma$ -s. The context of $B$ should be $A$ _ $C$ .

In the tagging procedure, the algorithm firstly gets a possible list of tags that $B$ may be assigned, i.e. $x$ and $y$ ; then consult the model for the context $A$ _ $C$ on which tag it mostly supports, for example $P(x|A\_C)$ is bigger than $P(y|A\_C)$ , then the PoS tag for $B$ should be $x$ .

Unsupervised training procedure

In the training procedure, the key problem is to figure out the probability that a particular context supports a particular tag, based on the unsupervised data. The paper provides an iterative method to make approximations to that probability.

For computational convenience, the algorithm estimates ${\tilde {n}}_{A\_x\_C}$ , the number of times that tag $x$ would appear in the context $A$ _ $C$ , instead of $P(x|A\_C)$ . They are interchangable.

We use a very simple example to give a description on how the iterative procedure runs.

Suppose we have only two training cases:

(1) A ^B/x/y$   C
(2) A ^D/x/y/z$ C

where we only focus on the context $A$ _ $C$ . Whether $A$ or $C$ is disambiguous or not will not affect the context.

Firstly, for each training case, we assume that the possibility for any tag is equal, because we don't have any other information. That is, according to the first case, ${\tilde {n}}_{A\_x\_C}=1/2$ , and ${\tilde {n}}_{A\_y\_C}=1/2$ ; according to the second case, ${\tilde {n}}_{A\_x\_C}=1/3$ , ${\tilde {n}}_{A\_y\_C}=1/3$ and ${\tilde {n}}_{A\_z\_C}=1/3$ . This is the initil step (or 0-th step) of the algorithm, and we finally get:

${\tilde {n}}_{A\_x\_C}^{[0]}=1*1/2+1*1/3=5/6$

${\tilde {n}}_{A\_y\_C}^{[0]}=1*1/2+1*1/3=5/6$

${\tilde {n}}_{A\_z\_C}^{[0]}=1*1/3=1/3$

Then, we can use these counts to estimate the following probabilities:

$p(x|A\_B\_C)={\frac {5/6}{5/6+5/6}}=1/2$

$p(y|A\_B\_C)={\frac {5/6}{5/6+5/6}}=1/2$

$p(x|A\_D\_C)={\frac {5/6}{5/6+5/6+1/3}}=5/12$

$p(y|A\_D\_C)={\frac {5/6}{5/6+5/6+1/3}}=5/12$

$p(z|A\_D\_C)={\frac {1/3}{5/6+5/6+1/3}}=1/6$

And then, using these probabilities we can re-estimate the counts as expectations:

${\tilde {n}}_{A\_x\_C}^{[1]}=1*1/2+1*5/12=11/12$

${\tilde {n}}_{A\_y\_C}^{[1]}=1*1/2+1*5/12=11/12$

${\tilde {n}}_{A\_z\_C}^{[1]}=1*1/6=1/6$

Until now, we have finished one iteration, with all the ${\tilde {n}}_{A\_\gamma \_C}$ values updated. And we can observe that ${\tilde {n}}_{A\_x\_C}^{[1]}$ and ${\tilde {n}}_{A\_y\_C}^{[1]}$ are growing bigger, and ${\tilde {n}}_{A\_z\_C}^{[1]}$ is growing smaller, while the total count still remains the same, $5/6+5/6+1/3=11/12+11/12+1/6=2$ .

Repeating the above procedure, the process will converge to a point, with:

${\tilde {n}}_{A\_x\_C}=1$

${\tilde {n}}_{A\_y\_C}=1$

${\tilde {n}}_{A\_z\_C}=0$

This is a simple example to show the iterative procedure of the unsupervised training algorithm. Real data may be more complex, but the core mechanism remains the same.

FORBID and ENFORCE ristrictions

Using Parameter Reduciton ( paper )

Basic plans

My plans are basiclly as follows:

step	plans
1	Read the paper and get a full understanding to the algorithm, and familiarize myself with the Apertium pipeline.
2	Implement an unsupervised version of the algorithm. Store the probability data in a clever way, which allows reading and edition using linguistic knowlege.
3	Implement FORBID and ENFORCE restrictions, probably by introducing Parameter Reduciton into the unsupervised algorithm. The SWPoST will use the same option and same TSX file as the existing HMM tagger, and will be entirely parallel to it.
4	Implement the minimized FST version.
5	Optionally, implement a supervised version of the algorithm, re-study the algorithm to explore possible improvements.

Why Google and Apertium should sponsor it? How and who it will benefit in society?

The PoS tagger plays an important role in the Apertium translation pipeline. A sliding-window PoS tagger can improve the quality of tagging, than the current HMM tagger. Besides, the algorithm is intuitive, and easy to understand and modify. So its implementation will benifit both the Apertium developers and users.

I would be really thankful if Google and Apertium would sponsor this project. I have understood the mathematical model in the algorithm, with its main idea described above. I think I have the knowledge and skills to accomplish the task, plus the enthusiasm for machine translation. I And Apertium is a very nice community that I would like to stay with for long time.

Work plan

The outline of the developing procedure are mentioned above as basic plans. A detailed work plan is as follows:

Coding Challenge

Task description: Sliding-window_part-of-speech_tagger

Write a filter that reads in the output of Apertium morphological analyser and writes out either a random one (-r) or the first one (-f) of the lexical form for each surface form in a new format, respecting superblanks.

Write a roundtrip converter that reads the outputted first or random lexical form and regenerate the input.

Solution: CodingChallenge

README: README

Community bonding period

Get further contact with the community, and familiarize myself with the Apertium platform and the project.
Make preparations, learn necessary knowledge that will be used in the implementation.

Week Plan

week	date	plans
Week 01	06.17-06.23	Implement the unsupervised version of the algorithm.
Week 02	06.24-06.30	Implement the unsupervised version of the algorithm.
Week 03	07.01-07.07	Store the probability data in a clever way, allowing reading and edition using linguistic knowlege.
Week 04	07.08-07.14	Make tests, check for bugs, and documentation.
Deliverable #1		A SWPoST that works with probabilities.
Week 05	07.15-07.21	Implement FORBID restrictions, probably using Parameter Reduciton.
Week 06	07.22-07.28	Implement FORBID restrictions, probably using Parameter Reduciton.
Week 07	07.29-08.04	Implement ENFORCE restrictions, probably using Parameter Reduciton.
Week 08	08.05-08.11	Make tests, check for bugs, and documentation.
Deliverable #2		A SWPoST that works with FORBID and ENFORCE restrictions.
Week 09	08.12-08.18	Implement the minimized FST version.
Week 10	08.19-08.25	Refine code. Optionally, implement the supervised version of the algorithm.
Week 11	08.26-09.01	Make tests, check the code and documentation. Optionally, Study further possible improvements.
Week 12	09.02-09.08	Make tests, check the code and documentation.
Deliverable #3		A full implementation of the SWPoST in Apertium.

Include time needed to think, to program, to document and to disseminate.

Generally and totally:

Programming and thinking will mainly take 8 weeks.

Documentation, surveying, and disseminatation will share the remaining 4 weeks.

List your skills and give evidence of your qualifications.

I am currently a 2-nd year postgraduate majoring in Natural Language Processing.

During the recent 2 years, I have been interning in a statistical machine translation team of Youdao Inc. There, I brought up 2 language pairs (Spanish-Chinese and Russian-Chinese) into online services [1]. With that chance, I explored the whole pipeline of a statistical machine translation system, including bilingual data crawling and filtering , word alignment, model extraction, and parameter tunning, etc. Such experience can help me to understand Apertium more quickly.

I also used Map-Reduce framework to process large amounts of text, to improve the Ngram language models. Finally it gained a +0.5 BLEU quality improvement and did well in human judgements for the Chinese-English translation. During the project, I watched and analysed a lot of raw text and ngram data, read many papers on data selection and domain adaptation, and conducted various kinds of experiments on text filtering and cleaning. Thanks to that project, it made me form a habit of analysing bad cases and data. Such experience may help me to debug and analyse cases more efficiently.

Besides, I also took part in a project for speeding up the translation decoder, and gained a significant 2x efficiency improvement at very little quality cost. We watched a lot of bad cases, analyse them linguistically, and finally we made a good balance between translation qualitiy and speed. Such experience may help in the implementation efficiency.

During my undergraduate period, I took courses on linguistics, mathematics and computer science, where I got the basic knowledge about languistic analysis and programming. I also took part in the development of some NLP systems, such as the construction of Chinese Concept Dictionary, which is a WordNet-like Chinese ontology, a lexical similarity computation software written in Java, and especially, an supervised HMM PoS tagger written in C++. The linguistic background and the coding experiences can help me to better understand the SWPoST and benifit the implementation.

As to the coding skills, I have 3 years' experience programming in Java and C++, and I am also familiar with basic Python and Bash for dealing with light-weight tasks.

This is the first time that I take part in an open source project, and I'm excited about it! I have been using open source toolkits for long, for example, the Moses machine translation toolkit, KenLM, OpenNLP toolkit, etc., and they all brought great help to me. I'd be happy to make some contributions to the open source community, and try best to adapt myself to the open source developing environment.

With the knowledge and experiences on natural language processing, I am confident in accomplishing the task well.

List any non-Summer-of-Code plans you have for the Summer

Doing some preparations for seeking a job, because I'm going to graduate next year.

@@ Line 1: / Line 1: @@
+== Proposal Title ==
+'''A sliding-window drop-in replacement for the HMM part-of-speech tagger in Apertium'''
 == Name ==
 '''Name:''' Gang Chen
@@ Line 163: / Line 166: @@
 |Optionally, implement a supervised version of the algorithm, re-study the algorithm to explore possible improvements.
 |}
-== Proposal Title ==
-'''A sliding-window drop-in replacement for the HMM part-of-speech tagger in Apertium'''
 == Why Google and Apertium should sponsor it? How and who it will benefit in society? ==

Difference between revisions of "User:Gang Chen/GSoC 2013 Application: "Sliding Window PoS Tagger""

Revision as of 13:40, 27 April 2013

Contents