Difference between revisions of "User:Nikita Medyankin/GSoC 2016 WTR Proposal"

From Apertium
Jump to navigation Jump to search
m (Why Google and Apertium should sponsor it)
(Edited in some feedback from Francis: added choice of pairs, reworked the schedule)
Line 7: Line 7:
*'''cell phone:''' +79260366763
*'''cell phone:''' +79260366763
*'''github:''' [https://github.com/tiefling-cat/ https://github.com/tiefling-cat/]
*'''github:''' [https://github.com/tiefling-cat/ https://github.com/tiefling-cat/]
*'''sourceforge:''' tiefling-cat


== Questions from Apertium ==
== Questions from Apertium ==
Line 47: Line 48:
As I understand the task now, the deliverables should be as follows:
As I understand the task now, the deliverables should be as follows:


*(1) Some piece(s) of code to accompany the process of computing the weights given the corpus and the rules so anyone can compute the weights for a given language pair.
*'''(1) Standalone training script.''' That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.


*(2) A piece of code in C++ integrated into Apertium that chooses the rules given the weights.
*'''(2) The code in C++ integrated into Apertium''' to choose the rules given the weights.


We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).
In order to make it, I first have to do the weight obtaining process manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).


The rough outline of the weight obtaining process is meant to be as follows:
The rough outline of the weight obtaining process is meant to be as follows:
Line 60: Line 63:
# Run supervised training, get the weights.
# Run supervised training, get the weights.


==== Example pairs ====
I think of en-es and rus-ukr pairs as the first is a released one, the second is in the [[Nursery|nursery]], and both have at least one language I know. Those are also the languages I can easily get a corpus of if need be.
'''Toy pair'''
First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

'''Real pairs'''
For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.


==== Documentation ====
==== Documentation ====
Besides the documentation for the code itself, I think а fully documented process of weight obtaining for one or two language pairs will come in handy.
Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.


==== Schedule ====
==== Schedule ====
'''Community Bonding Period (22th April—22th May)'''
'''Community Bonding Period (22th April—22th May)'''


*Coming to know about corpora available for training, obtaining some by myself if needed.
*Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
*Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
*Coming to know the core Apertium code.
*Getting acquainted with kenlm.
*Producing the toy pair.
*Introducing ambiguity into chosen real language pairs.


'''Week 1 (23th—29th May)'''
'''Week 1 (23th—29th May)'''


Obtaining the weights by hand for toy pair.
Improving the coding challenge script to take all transfer rules stages into account.


'''Week 2 (30th May—5th June)'''
'''Week 2 (30th May—5th June)'''


Obtaining the rules by hand for en-es.
Obtaining the weights by hand for en-es. Experimenting on the process.


'''Week 3 (6th—12th June)'''
'''Week 3 (6th—12th June)'''
Line 86: Line 95:
'''Week 4 (13th—19th June)'''
'''Week 4 (13th—19th June)'''


Implementation of deliverable (1).
Documenting the rule obtaining process.

Testing and bug fixing.


'''Mid-term evaluation'''
'''Mid-term evaluation'''
Line 94: Line 105:
'''Week 5 (20th—26th June)'''
'''Week 5 (20th—26th June)'''


Start of implementation of deliverable (2).
Additional experiments on weight obtaining process.

Obtaining the rules for rus-ukr.


'''Week 6 (27th June—3th July)'''
'''Week 6 (27th June—3th July)'''


Implementation of deliverable (2).
Coming to know the core Apertium code.


'''Week 7 (4th—10th July)'''
'''Week 7 (4th—10th July)'''
Line 126: Line 135:
'''Week 12 (8th—14th August)'''
'''Week 12 (8th—14th August)'''


Implementation of deliverable (2).
Testing, bug fixing, documentation.

Testing and bug fixing.


'''Final week (15th—23th August)'''
'''Final week (15th—23th August)'''


Testing, bug fixing, documentation.
Final testing and bug fixing.


=== Skills and qualification ===
=== Skills and qualification ===

Revision as of 16:57, 23 March 2016

Name

Nikita Medyankin

Contacts

  • e-mail: nikita.medyankin@gmail.com
  • IRC: tiefling-cat on #apertium at irc.freenode.net
  • cell phone: +79260366763
  • github: https://github.com/tiefling-cat/
  • sourceforge: tiefling-cat

Questions from Apertium

Why is it you are interested in machine translation?

As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT is a very good field to find a use for the skills and knowledge I obtained while studying as a linguist.

Why is it that you are interested in Apertium?

I first heard about Apertium from the info letter of RBMT summer school in Alacant. Honestly saying, I was a bit confused, because I previously thought that rule-based MT was rendered somewhat obsolete by statistical methods. But after that I had a quick chat with Francis in HSE. Francis showed me ukr-rus translation, and I was amazed at the quality of the resulting Russian text.

Then we had that meeting about GSoC with Francis and Ekaterina, and after that Francis told me that the main idea was to translate in pairs of related languages. Google Translate only has huge parallel corpora of English to smth, so it does double translation through English when asked to translate from non-English to non-English whereas Apertium can provide superior results translating directly. I also see that there are pairs of not-so-closely-related languages, but I imagine they might also be useful because of the direct nature of the translation, just a bit more tricky to elaborate.

All of the above made me very excited about the project. I also can see that Apertium is done by a band of very inspired people, and it's always great to work with people who really care about the results.

Which of the published tasks are you interested in?

I have chosen the Weighted transfer rules.

I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). He pointed at the Weighted transfer rules and I really liked it because, personally, I hate when ambiguity has to be solved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with Kira Droganova, check it out at http://web-corpora.net/wsgi3/ru-syntax/).

I also love machine learning and related tasks and did quite a few of ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.

What do you plan to do?

Сode 'em all.

The proposal

Apertium Weighted Transfer Rules

Why Google and Apertium should sponsor it

Well, regarding the task, now the rules are devised so as to prevent any conflicts, and that limits the system's capabilities. If someone would come to devise the rules disregarding the possible conflicts, that would mean he is basically relying on good luck and maybe a limited foresight to choose the right rule, which looks slightly better than just choosing the random rule.

The ambiguity is inevitable and natural feature of any natural language. Resolving it is the most difficult thing in NLP, but also the most important. I regard this task as both interesting for the quality improvements it is meant to bring over the current status, and as some sort of case study.

How and who it will benefit in society

Well, as long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is very useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of, will make the translations clearer and eventually make the World better, and people closer to each other.

Work plan

Preliminaries

I have installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at https://github.com/tiefling-cat/apertium-rules.

Deliverables

As I understand the task now, the deliverables should be as follows:

  • (1) Standalone training script. That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.
  • (2) The code in C++ integrated into Apertium to choose the rules given the weights.

We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

The rough outline of the weight obtaining process is meant to be as follows:

  1. Choose a language pair, get a corpus of texts in source language.
  2. For each sentence in the corpus, get all translations allowed by the rules present, or a subset of them (LRLM, optimal coverage, whatever).
  3. Score the variants using a target language model, like kenlm.
  4. Run supervised training, get the weights.

Example pairs

Toy pair First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

Real pairs For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.

Documentation

Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.

Schedule

Community Bonding Period (22th April—22th May)

  • Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
  • Coming to know the core Apertium code.
  • Producing the toy pair.
  • Introducing ambiguity into chosen real language pairs.

Week 1 (23th—29th May)

Obtaining the weights by hand for toy pair.

Week 2 (30th May—5th June)

Obtaining the weights by hand for en-es. Experimenting on the process.

Week 3 (6th—12th June)

Implementation of deliverable (1).

Week 4 (13th—19th June)

Implementation of deliverable (1).

Testing and bug fixing.

Mid-term evaluation

Deliverable (1).

Week 5 (20th—26th June)

Start of implementation of deliverable (2).

Week 6 (27th June—3th July)

Implementation of deliverable (2).

Week 7 (4th—10th July)

Implementation of deliverable (2).

Week 8 (11th—17th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 9 (18th—24th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 10 (25th—31th July)

Implementation of deliverable (2).

Week 11 (1th—7th August)

Implementation of deliverable (2).

Testing and bug fixing.

Week 12 (8th—14th August)

Implementation of deliverable (2).

Testing and bug fixing.

Final week (15th—23th August)

Final testing and bug fixing.

Skills and qualification

I've graduated from MSU, Faculty of Mechanics and Mathematics, Department of computational mathematics, where I got some proficiency in math, algorithms, databases, C/C++ coding, and overall good thinking.

After that, I entered HSE for master's program in Computational Linguistics, which is quite heavy on programming, so there I got some experience in python, machine learning, spent some time coding text and data processing tools, and grabbed a bit of web-development skills to present the results or build simple web interfaces for my tools.

Check out my most fresh NLP project that I developed in close collaboration with Kira Droganova, it's at http://web-corpora.net/wsgi3/ru-syntax/ Collection of my other projects and tasks for HSE program can be found at https://bitbucket.org/namelessone/ but please keep in mind that some of them are old, and I grew to better understand things regarding NLP and python since that time.

I mostly code in python, but C, C++, and sh are also readily available if need be. I also can easily adapt to using external tools and integrating with previously made code. I also have a penchant for coding things tidy and neat (and keeping them simple).

Non-GSoC plans

I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely User-friendly lexical selection training. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.

I will also need some time for my graduate work but I'll do my best to do the largest chunk of it by the 22th of April.