User:Nikita Medyankin/GSoC 2016 WTR Proposal

From Apertium
< User:Nikita Medyankin
Revision as of 18:37, 24 March 2016 by Nikita Medyankin (talk | contribs) (→‎Non-GSoC plans)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Name[edit]

Nikita Medyankin

Contacts[edit]

  • e-mail: nikita.medyankin@gmail.com
  • IRC: tiefling-cat on #apertium at irc.freenode.net
  • cell phone: +79260366763
  • github: https://github.com/tiefling-cat/
  • sourceforge: tiefling-cat

Questions from Apertium[edit]

Why is it you are interested in machine translation?[edit]

As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT looks like an excellent field for me to use the skills and knowledge I obtained while studying as a linguist, to improve them, and to learn more.

Why is it that you are interested in Apertium?[edit]

I first heard about Apertium from the info letter of RBMT summer school in Alacant. Honestly saying, I was a bit confused, because I previously thought that rule-based MT was rendered somewhat obsolete by statistical methods. But after that I had a quick chat with Francis in HSE. Francis showed me ukr-rus translation, and I was amazed at the quality of the resulting Russian text.

Then we had that meeting about GSoC with Francis and Ekaterina, and after that Francis told me that the main idea was to translate in pairs of related languages. Google Translate only has huge parallel corpora of English to smth, so it does double translation through English when asked to translate from non-English to non-English whereas Apertium can provide superior results translating directly. I also see that there are pairs of not-so-closely-related languages, but I imagine they might also be useful because of the direct nature of the translation, just a bit more tricky to elaborate.

All of the above made me very excited about the project. I also can see that Apertium is done by a band of very inspired people, and it's always great to work with people who really care about the results.

Which of the published tasks are you interested in?[edit]

I have chosen the Weighted transfer rules.

I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). Francis pointed at the Weighted transfer rules and I really liked it, because, personally, I hate when ambiguity has to be resolved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with Kira Droganova, check it out at http://web-corpora.net/wsgi3/ru-syntax/). And the more I learned about the task, the more I liked it.

I also love machine learning and related tasks and did quite a few of assorted ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.

What do you plan to do?[edit]

Сode 'em all.

The proposal[edit]

Apertium Weighted Transfer Rules

Why Google and Apertium should sponsor it[edit]

Currently, transfer rules have to be devised so as to prevent any conflicts, and that limits the system's capabilities, because ambiguity is an essential part of any natural language, and that fact should be reflected by the rules. Introducing of weighted rules would have a positive impact on both how the transfer rules represent the language pair and the quality of the resulting translation. Moreover, the task is meant to be an improvement to the architecture of Apertium and would benefit all language pairs.

How and who it will benefit in society[edit]

As long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is extremely useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of will make the translations clearer and eventually make the World better, and people closer to each other.

Work plan[edit]

Preliminaries[edit]

I have svn-installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at https://github.com/tiefling-cat/apertium-rules.

Deliverables[edit]

As I understand the task now, the deliverables should be as follows:

  • (1) Standalone training script. That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.
  • (2) The code in C++ integrated into Apertium to choose the rules given the weights.

We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

The rough outline of the weight obtaining process is meant to be as follows:

  1. Choose a language pair, get a corpus of texts in source language.
  2. For each sentence in the corpus, get all translations allowed by the rules present, or a subset of them (LRLM, optimal coverage, whatever).
  3. Score the variants using a target language model, like kenlm.
  4. Run supervised training, get the weights.

Example pairs[edit]

Toy pair First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

Real pairs For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.

Documentation[edit]

Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.

Schedule[edit]

Community Bonding Period (22th April—22th May)

  • Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
  • Coming to know the core Apertium code.
  • Producing the toy pair.
  • Introducing ambiguity into chosen real language pairs.

Week 1 (23th—29th May)

Obtaining the weights by hand for toy pair.

Week 2 (30th May—5th June)

Obtaining the weights by hand for en-es. Experimenting on the process.

Week 3 (6th—12th June)

Implementation of deliverable (1).

Week 4 (13th—19th June)

Implementation of deliverable (1).

Testing and bug fixing.

Mid-term evaluation

Deliverable (1).

Week 5 (20th—26th June)

Start of implementation of deliverable (2).

Week 6 (27th June—3th July)

Implementation of deliverable (2).

Week 7 (4th—10th July)

Implementation of deliverable (2).

Week 8 (11th—17th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 9 (18th—24th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 10 (25th—31th July)

Implementation of deliverable (2).

Week 11 (1th—7th August)

Implementation of deliverable (2).

Testing and bug fixing.

Week 12 (8th—14th August)

Implementation of deliverable (2).

Testing and bug fixing.

Final week (15th—23th August)

Final testing and bug fixing.

Skills and qualification[edit]

I've graduated from MSU, Faculty of Mechanics and Mathematics, Department of computational mathematics, where I got some proficiency in math, algorithms, databases, C/C++ coding, and overall good thinking.

After that, I entered HSE for master's program in Computational Linguistics, which is quite heavy on programming, so there I got some experience in python, machine learning, spent some time coding text and data processing tools, and grabbed a bit of web-development skills to present the results or build simple web interfaces for my tools.

Check out my most fresh NLP project that I developed in close collaboration with Kira Droganova, it's at http://web-corpora.net/wsgi3/ru-syntax/ Collection of my other projects and tasks for HSE program can be found at https://bitbucket.org/namelessone/ but please keep in mind that some of them are old, and I grew to better understand things regarding NLP and python since that time.

I mostly code in python, but C, C++, and sh are also readily available if need be. I also can easily adapt to using external tools and integrating with previously made code. I also have a penchant for coding things tidy and neat (and keeping them simple).

Non-GSoC plans[edit]

I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely User-friendly lexical selection training. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.

I will also need some time for my graduate work, which is due the 10th of June, but I'll do my best to do the largest chunk of it by the 22th of April.