Difference between revisions of "User:Deltamachine/proposal2018"

From Apertium
Jump to navigation Jump to search
Line 44: Line 44:
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].
 
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].
  +
  +
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.
  +
  +
=== Definitions ===
  +
* S: source sentence
  +
* MT: machine translation system (Apertium in our case)
  +
* MT(S): machine translation of S
  +
* PE(MT(S)): post-editing of the machine translation of S
  +
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S))
  +
  +
=== Work stages ===
  +
  +
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====
  +
  +
<u>About language pair</u>
  +
  +
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.
  +
  +
<u>Abour collecting and processing data</u>
  +
  +
There can be two approaches.
  +
  +
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p>
  +
  +
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p>
  +
  +
The question of choosing an approach is pretty discussable. I think that we might experiment with both approaches or even mix the different types of data and see if there any difference.
  +
  +
==== Improving of existing methods ====
  +
  +
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.
  +
  +
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless.
  +
  +
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.
  +
  +
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:
  +
  +
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"
  +
|-
  +
|
  +
|'''with caching'''
  +
|'''without caching'''
  +
|-
  +
|'''-m 1 -M 1'''
  +
|2m55s
  +
|3m41s
  +
|-
  +
|'''-m 2 -M 2'''
  +
|6m50s
  +
|7m08s
  +
|-
  +
|'''-m 3 -M 3'''
  +
|8m25s
  +
|10m25s
  +
|-
  +
|'''-m 4 -M 4'''
  +
|11m45s
  +
|13m18s
  +
|-
  +
|}
  +
  +
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.
  +
  +
<u>What else needs to be done:</u>
  +
  +
* Caching function for apply_postedits.py
  +
  +
* Search for other ways we can improve the speed.
  +
  +
2. Some code refactoring needs to be done.
  +
  +
==== Search of extracted postediting operations which improve the quality of translation ====
  +
  +
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.
  +
  +
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.
  +
  +
==== Classifying of successful postediting operations ====
  +
  +
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.
  +
  +
There might be few types:
  +
  +
* Monodix/bidix entries
  +
  +
* Lexical selection rules
  +
  +
* Transfer rules (?)
  +
  +
* and so on
  +
  +
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.
  +
  +
==== Creating tools for inserting useful information into a language pair ====
  +
  +
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.
   
 
== Reasons why Google and Apertium should sponsor it ==
 
== Reasons why Google and Apertium should sponsor it ==

Revision as of 17:46, 22 March 2018

Contact information

Name: Anna Kondrateva

Location: Moscow, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

IRC: deltamachine

SourceForge: deltamachine

Timezone: UTC+3

Skills and experience

I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

  • Programming (Python, R)
  • Computer Tools for Linguistic Research
  • Theory of Language (Phonetics, Morphology, Syntax, Semantics)
  • Language Diversity and Typology
  • Machine Learning
  • Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)
  • Theory of Algorithms
  • Databases

Technical skills:

  • Programming languages: Python, R, Javascript
  • Web design: HTML, CSS
  • Frameworks: Flask, Django
  • Databases: SQLite, PostgreSQL, MySQL

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?

I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!

Why is it that you are interested in Apertium?

I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more. Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.

This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.

Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.

Which of the published tasks are you interested in? What do you plan to do?

I would like to work on improving language pairs by mining MediaWiki Content Translation postedits.

The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.

Definitions

  • S: source sentence
  • MT: machine translation system (Apertium in our case)
  • MT(S): machine translation of S
  • PE(MT(S)): post-editing of the machine translation of S
  • O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S))

Work stages

Choosing a language pair(s) to experiment with and collecting/processing data.

About language pair

I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.

Abour collecting and processing data

There can be two approaches.

  • Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/.

    + the target (postedited) side is very close to the given machine translation because it is basically based on it.

    - Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm.

    - also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.

  • Using parallel corpora and Apertium translation of the source side.

    + in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.

    + parallel corpora are more likely to contain less noise.

    - the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.

The question of choosing an approach is pretty discussable. I think that we might experiment with both approaches or even mix the different types of data and see if there any difference.

Improving of existing methods

Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.

1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless.

I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.

Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:

with caching without caching
-m 1 -M 1 2m55s 3m41s
-m 2 -M 2 6m50s 7m08s
-m 3 -M 3 8m25s 10m25s
-m 4 -M 4 11m45s 13m18s

It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.

What else needs to be done:

  • Caching function for apply_postedits.py
  • Search for other ways we can improve the speed.

2. Some code refactoring needs to be done.

Search of extracted postediting operations which improve the quality of translation

The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.

A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.

Classifying of successful postediting operations

After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.

There might be few types:

  • Monodix/bidix entries
  • Lexical selection rules
  • Transfer rules (?)
  • and so on

For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.

Creating tools for inserting useful information into a language pair

The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.

Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.

Work plan

Post application period

Community bonding period

Work period

    Part 1, weeks 1-4:

  • Week 1:
  • Week 2:
  • Week 3:
  • Week 4:
  • Deliverable #1, June 26 - 30
  • Part 2, weeks 5-8:

  • Week 5:
  • Week 6:
  • Week 7:
  • Week 8:
  • Deliverable #2, July 24 - 28
  • Part 3, weeks 9-12:

  • Week 9:
  • Week 10:
  • Week 11: testing, fixing bugs
  • Week 12: cleaning up the code, writing documentation
  • Project completed:

Also I am going to write short notes about work process on the page of my project during the whole summer.

Non-Summer-of-Code plans you have for the Summer

I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge

https://github.com/deltamachine/naive-automatic-postediting

  • parse_ct_json.py: A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.
  • estimate_changes.py: A script that takes a file generated by apply_postedits.py and scores sentences which were processed with postediting rules on a language model.

Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in cleaned_learn_postedits.py