Difference between revisions of "User:Nikita Medyankin/GSoC 2016 WTR Proposal"

Revision as of 16:57, 23 March 2016

Name

Nikita Medyankin

Contacts

e-mail: nikita.medyankin@gmail.com
IRC: tiefling-cat on #apertium at irc.freenode.net
cell phone: +79260366763
github: https://github.com/tiefling-cat/
sourceforge: tiefling-cat

Questions from Apertium

Why is it you are interested in machine translation?

As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT is a very good field to find a use for the skills and knowledge I obtained while studying as a linguist.

Why is it that you are interested in Apertium?

I first heard about Apertium from the info letter of RBMT summer school in Alacant. Honestly saying, I was a bit confused, because I previously thought that rule-based MT was rendered somewhat obsolete by statistical methods. But after that I had a quick chat with Francis in HSE. Francis showed me ukr-rus translation, and I was amazed at the quality of the resulting Russian text.

Then we had that meeting about GSoC with Francis and Ekaterina, and after that Francis told me that the main idea was to translate in pairs of related languages. Google Translate only has huge parallel corpora of English to smth, so it does double translation through English when asked to translate from non-English to non-English whereas Apertium can provide superior results translating directly. I also see that there are pairs of not-so-closely-related languages, but I imagine they might also be useful because of the direct nature of the translation, just a bit more tricky to elaborate.

All of the above made me very excited about the project. I also can see that Apertium is done by a band of very inspired people, and it's always great to work with people who really care about the results.

Which of the published tasks are you interested in?

I have chosen the Weighted transfer rules.

I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). He pointed at the Weighted transfer rules and I really liked it because, personally, I hate when ambiguity has to be solved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with Kira Droganova, check it out at http://web-corpora.net/wsgi3/ru-syntax/).

I also love machine learning and related tasks and did quite a few of ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.

What do you plan to do?

Сode 'em all.

The proposal

Apertium Weighted Transfer Rules

Why Google and Apertium should sponsor it

Well, regarding the task, now the rules are devised so as to prevent any conflicts, and that limits the system's capabilities. If someone would come to devise the rules disregarding the possible conflicts, that would mean he is basically relying on good luck and maybe a limited foresight to choose the right rule, which looks slightly better than just choosing the random rule.

The ambiguity is inevitable and natural feature of any natural language. Resolving it is the most difficult thing in NLP, but also the most important. I regard this task as both interesting for the quality improvements it is meant to bring over the current status, and as some sort of case study.

How and who it will benefit in society

Well, as long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is very useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of, will make the translations clearer and eventually make the World better, and people closer to each other.

Work plan

Preliminaries

I have installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at https://github.com/tiefling-cat/apertium-rules.

Deliverables

As I understand the task now, the deliverables should be as follows:

(1) Standalone training script. That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.

(2) The code in C++ integrated into Apertium to choose the rules given the weights.

We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

The rough outline of the weight obtaining process is meant to be as follows:

Choose a language pair, get a corpus of texts in source language.
For each sentence in the corpus, get all translations allowed by the rules present, or a subset of them (LRLM, optimal coverage, whatever).
Score the variants using a target language model, like kenlm.
Run supervised training, get the weights.

Example pairs

Toy pair First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

Real pairs For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.

Documentation

Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.

Schedule

Community Bonding Period (22th April—22th May)

Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
Coming to know the core Apertium code.
Producing the toy pair.
Introducing ambiguity into chosen real language pairs.

Week 1 (23th—29th May)

Obtaining the weights by hand for toy pair.

Week 2 (30th May—5th June)

Obtaining the weights by hand for en-es. Experimenting on the process.

Week 3 (6th—12th June)

Implementation of deliverable (1).

Week 4 (13th—19th June)

Implementation of deliverable (1).

Testing and bug fixing.

Mid-term evaluation

Deliverable (1).

Week 5 (20th—26th June)

Start of implementation of deliverable (2).

Week 6 (27th June—3th July)