Difference between revisions of "User:Nikita Medyankin/GSoC 2016 WTR Proposal"

From Apertium
Jump to navigation Jump to search
m (Edited links)
m (Why Google and Apertium should sponsor it)
Line 33: Line 33:
   
 
=== Why Google and Apertium should sponsor it ===
 
=== Why Google and Apertium should sponsor it ===
Well, regarding the task, I guess now we are basically relying on our good luck and maybe a limited foresight from those who designed the rules and put them into that order, while choosing the right rule. That looks slightly better than just choosing the random rule.
+
Well, regarding the task, now the rules are devised so as to prevent any conflicts, and that limits the system's capabilities. If someone would come to devise the rules disregarding the possible conflicts, that would mean he is basically relying on good luck and maybe a limited foresight to choose the right rule, which looks slightly better than just choosing the random rule.
   
Resolving the ambiguity inevitably presented in natural language is the most difficult thing in NLP, but also the most important. I think the task is both interesting for the quality improvements it is meant to bring over what we have now and as a sort of case study.
+
The ambiguity is inevitable and natural feature of any natural language. Resolving it is the most difficult thing in NLP, but also the most important. I regard this task as both interesting for the quality improvements it is meant to bring over the current status, and as some sort of case study.
   
 
=== How and who it will benefit in society ===
 
=== How and who it will benefit in society ===

Revision as of 16:52, 23 March 2016

Name

Nikita Medyankin

Contacts

Questions from Apertium

Why is it you are interested in machine translation?

As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT is a very good field to find a use for the skills and knowledge I obtained while studying as a linguist.

Why is it that you are interested in Apertium?

I first heard about Apertium from the info letter of RBMT summer school in Alacant. Honestly saying, I was a bit confused, because I previously thought that rule-based MT was rendered somewhat obsolete by statistical methods. But after that I had a quick chat with Francis in HSE. Francis showed me ukr-rus translation, and I was amazed at the quality of the resulting Russian text.

Then we had that meeting about GSoC with Francis and Ekaterina, and after that Francis told me that the main idea was to translate in pairs of related languages. Google Translate only has huge parallel corpora of English to smth, so it does double translation through English when asked to translate from non-English to non-English whereas Apertium can provide superior results translating directly. I also see that there are pairs of not-so-closely-related languages, but I imagine they might also be useful because of the direct nature of the translation, just a bit more tricky to elaborate.

All of the above made me very excited about the project. I also can see that Apertium is done by a band of very inspired people, and it's always great to work with people who really care about the results.

Which of the published tasks are you interested in?

I have chosen the Weighted transfer rules.

I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). He pointed at the Weighted transfer rules and I really liked it because, personally, I hate when ambiguity has to be solved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with Kira Droganova, check it out at http://web-corpora.net/wsgi3/ru-syntax/).

I also love machine learning and related tasks and did quite a few of ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.

What do you plan to do?

Сode 'em all.

The proposal

Apertium Weighted Transfer Rules

Why Google and Apertium should sponsor it

Well, regarding the task, now the rules are devised so as to prevent any conflicts, and that limits the system's capabilities. If someone would come to devise the rules disregarding the possible conflicts, that would mean he is basically relying on good luck and maybe a limited foresight to choose the right rule, which looks slightly better than just choosing the random rule.

The ambiguity is inevitable and natural feature of any natural language. Resolving it is the most difficult thing in NLP, but also the most important. I regard this task as both interesting for the quality improvements it is meant to bring over the current status, and as some sort of case study.

How and who it will benefit in society

Well, as long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is very useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of, will make the translations clearer and eventually make the World better, and people closer to each other.

Work plan

Preliminaries

I have installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at https://github.com/tiefling-cat/apertium-rules.

Deliverables

As I understand the task now, the deliverables should be as follows:

  • (1) Some piece(s) of code to accompany the process of computing the weights given the corpus and the rules so anyone can compute the weights for a given language pair.
  • (2) A piece of code in C++ integrated into Apertium that chooses the rules given the weights.

In order to make it, I first have to do the weight obtaining process manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

The rough outline of the weight obtaining process is meant to be as follows:

  1. Choose a language pair, get a corpus of texts in source language.
  2. For each sentence in the corpus, get all translations allowed by the rules present, or a subset of them (LRLM, optimal coverage, whatever).
  3. Score the variants using a target language model, like kenlm.
  4. Run supervised training, get the weights.

I think of en-es and rus-ukr pairs as the first is a released one, the second is in the nursery, and both have at least one language I know. Those are also the languages I can easily get a corpus of if need be.

Documentation

Besides the documentation for the code itself, I think а fully documented process of weight obtaining for one or two language pairs will come in handy.

Schedule

Community Bonding Period (22th April—22th May)

  • Coming to know about corpora available for training, obtaining some by myself if needed.
  • Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
  • Getting acquainted with kenlm.

Week 1 (23th—29th May)

Improving the coding challenge script to take all transfer rules stages into account.

Week 2 (30th May—5th June)

Obtaining the rules by hand for en-es.

Week 3 (6th—12th June)

Implementation of deliverable (1).

Week 4 (13th—19th June)

Documenting the rule obtaining process.

Mid-term evaluation

Deliverable (1).

Week 5 (20th—26th June)

Additional experiments on weight obtaining process.

Obtaining the rules for rus-ukr.

Week 6 (27th June—3th July)

Coming to know the core Apertium code.

Week 7 (4th—10th July)

Implementation of deliverable (2).

Week 8 (11th—17th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 9 (18th—24th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 10 (25th—31th July)

Implementation of deliverable (2).

Week 11 (1th—7th August)

Implementation of deliverable (2).

Testing and bug fixing.

Week 12 (8th—14th August)

Testing, bug fixing, documentation.

Final week (15th—23th August)

Testing, bug fixing, documentation.

Skills and qualification

I've graduated from MSU, Faculty of Mechanics and Mathematics, Department of computational mathematics, where I got some proficiency in math, algorithms, databases, C/C++ coding, and overall good thinking.

After that, I entered HSE for master's program in Computational Linguistics, which is quite heavy on programming, so there I got some experience in python, machine learning, spent some time coding text and data processing tools, and grabbed a bit of web-development skills to present the results or build simple web interfaces for my tools.

Check out my most fresh NLP project that I developed in close collaboration with Kira Droganova, it's at http://web-corpora.net/wsgi3/ru-syntax/ Collection of my other projects and tasks for HSE program can be found at https://bitbucket.org/namelessone/ but please keep in mind that some of them are old, and I grew to better understand things regarding NLP and python since that time.

I mostly code in python, but C, C++, and sh are also readily available if need be. I also can easily adapt to using external tools and integrating with previously made code. I also have a penchant for coding things tidy and neat (and keeping them simple).

Non-GSoC plans

I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely User-friendly lexical selection training. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.

I will also need some time for my graduate work but I'll do my best to do the largest chunk of it by the 22th of April.