Difference between revisions of "User:Nikita Medyankin/GSoC 2016 WTR Proposal"

From Apertium
Jump to navigation Jump to search
m (Edited links)
 
(4 intermediate revisions by the same user not shown)
Line 3: Line 3:


== Contacts ==
== Contacts ==
{{:User:Nikita Medyankin}}
*'''e-mail:''' nikita.medyankin@gmail.com
*'''IRC:''' tiefling-cat on #apertium at irc.freenode.net
*'''cell phone:''' +79260366763
*'''github:''' [https://github.com/tiefling-cat/ https://github.com/tiefling-cat/]

== Questions from Apertium ==
== Questions from Apertium ==
=== Why is it you are interested in machine translation? ===
=== Why is it you are interested in machine translation? ===
As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT is a very good field to find a use for the skills and knowledge I obtained while studying as a linguist.
As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT looks like an excellent field for me to use the skills and knowledge I obtained while studying as a linguist, to improve them, and to learn more.


=== Why is it that you are interested in Apertium? ===
=== Why is it that you are interested in Apertium? ===
Line 22: Line 18:
I have chosen the [[Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules|Weighted transfer rules]].
I have chosen the [[Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules|Weighted transfer rules]].


I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). He pointed at the Weighted transfer rules and I really liked it because, personally, I hate when ambiguity has to be solved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with [[User:Kiara|Kira Droganova]], check it out at http://web-corpora.net/wsgi3/ru-syntax/).
I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). Francis pointed at the Weighted transfer rules and I really liked it, because, personally, I hate when ambiguity has to be resolved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with [[User:Kiara|Kira Droganova]], check it out at http://web-corpora.net/wsgi3/ru-syntax/). And the more I learned about the task, the more I liked it.


I also love machine learning and related tasks and did quite a few of ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.
I also love machine learning and related tasks and did quite a few of assorted ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.


=== What do you plan to do? ===
=== What do you plan to do? ===
Line 33: Line 29:


=== Why Google and Apertium should sponsor it ===
=== Why Google and Apertium should sponsor it ===
Currently, transfer rules have to be devised so as to prevent any conflicts, and that limits the system's capabilities, because ambiguity is an essential part of any natural language, and that fact should be reflected by the rules. Introducing of weighted rules would have a positive impact on both how the transfer rules represent the language pair and the quality of the resulting translation. Moreover, the task is meant to be an improvement to the architecture of Apertium and would benefit all language pairs.
Well, regarding the task, I guess now we are basically relying on our good luck and maybe a limited foresight from those who designed the rules and put them into that order, while choosing the right rule. That looks slightly better than just choosing the random rule.

Resolving the ambiguity inevitably presented in natural language is the most difficult thing in NLP, but also the most important. I think the task is both interesting for the quality improvements it is meant to bring over what we have now and as a sort of case study.


=== How and who it will benefit in society ===
=== How and who it will benefit in society ===
Well, as long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is very useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of, will make the translations clearer and eventually make the World better, and people closer to each other.
As long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is extremely useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of will make the translations clearer and eventually make the World better, and people closer to each other.


=== Work plan ===
=== Work plan ===
==== Preliminaries ====
==== Preliminaries ====
I have installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at [https://github.com/tiefling-cat/apertium-rules https://github.com/tiefling-cat/apertium-rules].
I have svn-installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at [https://github.com/tiefling-cat/apertium-rules https://github.com/tiefling-cat/apertium-rules].


==== Deliverables ====
==== Deliverables ====
As I understand the task now, the deliverables should be as follows:
As I understand the task now, the deliverables should be as follows:


*(1) Some piece(s) of code to accompany the process of computing the weights given the corpus and the rules so anyone can compute the weights for a given language pair.
*'''(1) Standalone training script.''' That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.


*(2) A piece of code in C++ integrated into Apertium that chooses the rules given the weights.
*'''(2) The code in C++ integrated into Apertium''' to choose the rules given the weights.


We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).
In order to make it, I first have to do the weight obtaining process manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).


The rough outline of the weight obtaining process is meant to be as follows:
The rough outline of the weight obtaining process is meant to be as follows:
Line 60: Line 56:
# Run supervised training, get the weights.
# Run supervised training, get the weights.


==== Example pairs ====
I think of en-es and rus-ukr pairs as the first is a released one, the second is in the [[Nursery|nursery]], and both have at least one language I know. Those are also the languages I can easily get a corpus of if need be.
'''Toy pair'''
First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

'''Real pairs'''
For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.


==== Documentation ====
==== Documentation ====
Besides the documentation for the code itself, I think а fully documented process of weight obtaining for one or two language pairs will come in handy.
Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.


==== Schedule ====
==== Schedule ====
'''Community Bonding Period (22th April—22th May)'''
'''Community Bonding Period (22th April—22th May)'''


*Coming to know about corpora available for training, obtaining some by myself if needed.
*Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
*Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
*Coming to know the core Apertium code.
*Getting acquainted with kenlm.
*Producing the toy pair.
*Introducing ambiguity into chosen real language pairs.


'''Week 1 (23th—29th May)'''
'''Week 1 (23th—29th May)'''


Obtaining the weights by hand for toy pair.
Improving the coding challenge script to take all transfer rules stages into account.


'''Week 2 (30th May—5th June)'''
'''Week 2 (30th May—5th June)'''


Obtaining the rules by hand for en-es.
Obtaining the weights by hand for en-es. Experimenting on the process.


'''Week 3 (6th—12th June)'''
'''Week 3 (6th—12th June)'''
Line 86: Line 88:
'''Week 4 (13th—19th June)'''
'''Week 4 (13th—19th June)'''


Implementation of deliverable (1).
Documenting the rule obtaining process.

Testing and bug fixing.


'''Mid-term evaluation'''
'''Mid-term evaluation'''
Line 94: Line 98:
'''Week 5 (20th—26th June)'''
'''Week 5 (20th—26th June)'''


Start of implementation of deliverable (2).
Additional experiments on weight obtaining process.

Obtaining the rules for rus-ukr.


'''Week 6 (27th June—3th July)'''
'''Week 6 (27th June—3th July)'''


Implementation of deliverable (2).
Coming to know the core Apertium code.


'''Week 7 (4th—10th July)'''
'''Week 7 (4th—10th July)'''
Line 126: Line 128:
'''Week 12 (8th—14th August)'''
'''Week 12 (8th—14th August)'''


Implementation of deliverable (2).
Testing, bug fixing, documentation.

Testing and bug fixing.


'''Final week (15th—23th August)'''
'''Final week (15th—23th August)'''


Testing, bug fixing, documentation.
Final testing and bug fixing.


=== Skills and qualification ===
=== Skills and qualification ===
Line 145: Line 149:
I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely [[Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training| User-friendly lexical selection training]]. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.
I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely [[Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training| User-friendly lexical selection training]]. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.


I will also need some time for my graduate work but I'll do my best to do the largest chunk of it by the 22th of April.
I will also need some time for my graduate work, which is due the 10th of June, but I'll do my best to do the largest chunk of it by the 22th of April.

Latest revision as of 18:37, 24 March 2016

Name[edit]

Nikita Medyankin

Contacts[edit]

  • e-mail: nikita.medyankin@gmail.com
  • IRC: tiefling-cat on #apertium at irc.freenode.net
  • cell phone: +79260366763
  • github: https://github.com/tiefling-cat/
  • sourceforge: tiefling-cat

Questions from Apertium[edit]

Why is it you are interested in machine translation?[edit]

As I understand, modern rule-based machine translation is a fine mix of linguistics, various NLP tools, machine learning, coding, and maths. I love all those things, and I also love when things from assorted fields are put together to produce something complex. RBMT looks like an excellent field for me to use the skills and knowledge I obtained while studying as a linguist, to improve them, and to learn more.

Why is it that you are interested in Apertium?[edit]

I first heard about Apertium from the info letter of RBMT summer school in Alacant. Honestly saying, I was a bit confused, because I previously thought that rule-based MT was rendered somewhat obsolete by statistical methods. But after that I had a quick chat with Francis in HSE. Francis showed me ukr-rus translation, and I was amazed at the quality of the resulting Russian text.

Then we had that meeting about GSoC with Francis and Ekaterina, and after that Francis told me that the main idea was to translate in pairs of related languages. Google Translate only has huge parallel corpora of English to smth, so it does double translation through English when asked to translate from non-English to non-English whereas Apertium can provide superior results translating directly. I also see that there are pairs of not-so-closely-related languages, but I imagine they might also be useful because of the direct nature of the translation, just a bit more tricky to elaborate.

All of the above made me very excited about the project. I also can see that Apertium is done by a band of very inspired people, and it's always great to work with people who really care about the results.

Which of the published tasks are you interested in?[edit]

I have chosen the Weighted transfer rules.

I asked Francis what ideas for Alacant and GSoC I should look at supposing I'm more of a code monkey than a linguist (and I don't know any languages besides Russian and English anyway). Francis pointed at the Weighted transfer rules and I really liked it, because, personally, I hate when ambiguity has to be resolved by just choosing the first alternative from the list. I myself have some history of battling case/number ambiguity of Russian nouns for the project of syntax parsing website done in collaboration with Kira Droganova, check it out at http://web-corpora.net/wsgi3/ru-syntax/). And the more I learned about the task, the more I liked it.

I also love machine learning and related tasks and did quite a few of assorted ML tasks studying as a linguist at HSE, and the opportunity to learn about the new instruments or to put my skills to a good use is always exciting.

What do you plan to do?[edit]

Сode 'em all.

The proposal[edit]

Apertium Weighted Transfer Rules

Why Google and Apertium should sponsor it[edit]

Currently, transfer rules have to be devised so as to prevent any conflicts, and that limits the system's capabilities, because ambiguity is an essential part of any natural language, and that fact should be reflected by the rules. Introducing of weighted rules would have a positive impact on both how the transfer rules represent the language pair and the quality of the resulting translation. Moreover, the task is meant to be an improvement to the architecture of Apertium and would benefit all language pairs.

How and who it will benefit in society[edit]

As long as there is a variety of languages in the World, people will always need translations and while I personally still think that translating fine fiction is a work for human artist, machine translation is extremely useful when you need to read some documentation, wikis, sites, and the like in a foreign language. Any improvements to the translation quality we can think of will make the translations clearer and eventually make the World better, and people closer to each other.

Work plan[edit]

Preliminaries[edit]

I have svn-installed all the Apertium tools, a couple of language pairs for experimenting, went through the tutorial for the transfer rules, and done the coding challenge. You can check it out at https://github.com/tiefling-cat/apertium-rules.

Deliverables[edit]

As I understand the task now, the deliverables should be as follows:

  • (1) Standalone training script. That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.
  • (2) The code in C++ integrated into Apertium to choose the rules given the weights.

We agreed with Francis on that the weights are ought to be put into a separate file (think xml or maybe yaml, it's a topic to discuss).

In order to make it, I first have to obtain the weights manually on some language pair, experiment, check if it works at all, then implement (1) and (2).

The rough outline of the weight obtaining process is meant to be as follows:

  1. Choose a language pair, get a corpus of texts in source language.
  2. For each sentence in the corpus, get all translations allowed by the rules present, or a subset of them (LRLM, optimal coverage, whatever).
  3. Score the variants using a target language model, like kenlm.
  4. Run supervised training, get the weights.

Example pairs[edit]

Toy pair First, a toy language pair will be created for the purposes of coding, debugging and experimenting in a small controlled environment. It would be a simple language pair with a few conflicting rules, and a small corpus of maybe twenty sentences or even pieces of sentences. For example, it might be a rus-eng pair with rules mostly regarding translating Genitive.

Real pairs For experiments on real language material, some pairs with rich collection of rules should be chosen, e.g., en-es, eu-es, eng-kaz. Francis told me that the rules are now designed so that the ambiguity is not present (I've encountered at least two pairs of conflicting rules in apertium-en-es.en-es.t1x file while doing the coding challenge, although that is probably because I've been calculating the coverages on ambiguous output of lt-proc). Thus, I will need the help of the mentors to deliberately add conflicting rules to the existing set.

Documentation[edit]

Besides the documentation for the code itself, I think а fully documented process of weight obtaining will come in handy. As per Francis' advice, the code itself will be documented alongside the coding process, because as he put it, if the documentation is left for the last part, it's not done at all.

Schedule[edit]

Community Bonding Period (22th April—22th May)

  • Getting to better know the workflow of how the rules are applied as I only worked with .t1x file while doing the coding challenge.
  • Coming to know the core Apertium code.
  • Producing the toy pair.
  • Introducing ambiguity into chosen real language pairs.

Week 1 (23th—29th May)

Obtaining the weights by hand for toy pair.

Week 2 (30th May—5th June)

Obtaining the weights by hand for en-es. Experimenting on the process.

Week 3 (6th—12th June)

Implementation of deliverable (1).

Week 4 (13th—19th June)

Implementation of deliverable (1).

Testing and bug fixing.

Mid-term evaluation

Deliverable (1).

Week 5 (20th—26th June)

Start of implementation of deliverable (2).

Week 6 (27th June—3th July)

Implementation of deliverable (2).

Week 7 (4th—10th July)

Implementation of deliverable (2).

Week 8 (11th—17th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 9 (18th—24th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training (see Non-GSoC plans below).

Week 10 (25th—31th July)

Implementation of deliverable (2).

Week 11 (1th—7th August)

Implementation of deliverable (2).

Testing and bug fixing.

Week 12 (8th—14th August)

Implementation of deliverable (2).

Testing and bug fixing.

Final week (15th—23th August)

Final testing and bug fixing.

Skills and qualification[edit]

I've graduated from MSU, Faculty of Mechanics and Mathematics, Department of computational mathematics, where I got some proficiency in math, algorithms, databases, C/C++ coding, and overall good thinking.

After that, I entered HSE for master's program in Computational Linguistics, which is quite heavy on programming, so there I got some experience in python, machine learning, spent some time coding text and data processing tools, and grabbed a bit of web-development skills to present the results or build simple web interfaces for my tools.

Check out my most fresh NLP project that I developed in close collaboration with Kira Droganova, it's at http://web-corpora.net/wsgi3/ru-syntax/ Collection of my other projects and tasks for HSE program can be found at https://bitbucket.org/namelessone/ but please keep in mind that some of them are old, and I grew to better understand things regarding NLP and python since that time.

I mostly code in python, but C, C++, and sh are also readily available if need be. I also can easily adapt to using external tools and integrating with previously made code. I also have a penchant for coding things tidy and neat (and keeping them simple).

Non-GSoC plans[edit]

I am going to apply for RBMT Summer School in Alacant, Spain, to work there on another Apertium task, namely User-friendly lexical selection training. If it will be accepted, I will need a vacation from my GSoC assignment 11th–22nd July. As advised by Tommi Pirinen, I can think of spreading those missing hours over other weeks.

I will also need some time for my graduate work, which is due the 10th of June, but I'll do my best to do the largest chunk of it by the 22th of April.