User:Deltamachine/proposal2018

1 Contact information
2 Skills and experience
3 Why is it you are interested in machine translation?
4 Why is it that you are interested in Apertium?
5 Which of the published tasks are you interested in? What do you plan to do?
- 5.1 Definitions
- 5.2 Work stages
6 Reasons why Google and Apertium should sponsor it
7 A description of how and who it will benefit in society
8 Work plan
9 Non-Summer-of-Code plans you have for the Summer
10 Coding challenge

Contact information[edit]

Name: Anna Kondrateva

Location: Moscow/Yekaterinburg, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

IRC: deltamachine

Github: deltamachine

Timezone: UTC+3 (Moscow) / UTC+5 (Yekaterinburg)

Skills and experience[edit]

I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

Programming (Python, R)
Computer Tools for Linguistic Research
Theory of Language (Phonetics, Morphology, Syntax, Semantics)
Language Diversity and Typology
Machine Learning
Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)
Theory of Algorithms
Databases

Technical skills:

Programming languages: Python, R, Javascript
Web design: HTML, CSS
Frameworks: Flask, Django
Databases: SQLite, PostgreSQL, MySQL

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?[edit]

I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!

Why is it that you are interested in Apertium?[edit]

I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more. Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.

This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for support those languages.

Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I would like to work on improving language pairs by mining MediaWiki Content Translation postedits.

The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.

Definitions[edit]

S: source sentence
MT: machine translation system (Apertium in our case)
MT(S): machine translation of S
PE(MT(S)): post-editing of the machine translation of S
O(s, mt, pe): set of extracted postediting operations where s in substr(S), mt in substr(MT(S)), pe in substr(PE(MT(S)))

Work stages[edit]

Choosing a language pair(s) to experiment with and collecting/processing data.[edit]

About collecting and processing data

There might be two approaches.

Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/.
+ the target (postedited) side is very close to the given machine translation because it is basically based on it.

- Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm.

- there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will make our data noisy.
- sometimes posteditors change the contents of a paragraph to make the article better: they split original sentences, add new information, etc. But these cases could probably be filtered.

Using parallel corpora and Apertium translation of the source side.
+ in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.

+ parallel corpora are more likely to contain less noise.

- the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.

The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.

About language pairs

I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.

The problem with working with bel-rus and ukr-rus is the comparably small amount of postedited data and more or less suitable parallel corpora. I'm still looking for data, but current situation looks like this:

bel-rus

Russian -> Belarusian Mediawiki corpus of Apertium translated and postedited data = 1895 sentences.
Tatoeba parallel corpus = about 1800 sentences
A few specific parallel corpora like KDE and GNOME

ukr-rus

Russian -> Ukranian Mediawiki corpus of Apertium translated and postedited data = 60 sentences.
Tatoeba parallel corpus = about 6500 sentences
OpenSubtitles2016 parallel corpus = about 400000 sentences (might contain free translations)
OpenSubtitles2018 parallel corpus = about 600000 sentences (might contain free translations)
A few specific parallel corpora like KDE and GNOME

But the methods I'm going to develop are not going to be tied to a language pair. We might choose another language pair or start our experiments with a small amount of data. The question is discussable.

Improving of existing methods[edit]

Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.

1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless.

I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.

Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:

	with caching	without caching
-m 1 -M 1	2m55s	3m41s
-m 2 -M 2	6m50s	7m08s
-m 3 -M 3	8m25s	10m25s
-m 4 -M 4	11m45s	13m18s

It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.

What else needs to be done:

Caching function for apply_postedits.py

Searching for other ways of improving the speed.

2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.

3. Some old code refactoring needs to be done.

Searching for extracted postediting operations that improve the quality of translation[edit]

The next step is to process a big training set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.

A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.

Classifying of successful postediting operations[edit]

After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.

There might be few types:

Monodix/bidix entries

Lexical selection rules

Transfer rules (?)

and so on

For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.

But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.

Creating tools for inserting useful information into a language pair[edit]

The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.

Reasons why Google and Apertium should sponsor it[edit]

This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.

For example, Apertium translates Belarusian sentence "Нехта тут размаўляе па-руску?" ("Does somebody here speak Russian?") in Russian as "Кто-то здесь размаўляе по-русски?" when a correct translation would be "Кто-то здесь говорит по-русски?". The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).

Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like "Аня выпіла кубак малака" ("Anya has drunk a cup of milk") and in contexts like "Кубак Нямеччыны па футболе" ("German football cup"). Apertium translates "Аня выпіла кубак малака" in Russian as "Аня выпила кубок молока" and "Кубак Нямеччыны па футболе" as "Кубок Нямеччыны па футболе" (it has many mistakes, but we are looking only at "кубок" now).

The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like "Кубок Германии по футболу"), but the first one looks strange: it should be "Аня выпила чашку/кружку молока" instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:

    <rule>
        <match lemma="выпiть" tags="*"/>
        <match lemma="кубак" tags="n.*">
	  <select lemma="чашка" tags="n.*"/>
	</match>
    </rule>

This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.

A description of how and who it will benefit in society[edit]

Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.

Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.

Work plan[edit]

Post application period[edit]

Learning more about the structure of Apertium dictionaries and tools
Taking an online statistics course to refresh my knowledge
Working on the old code

Community bonding period[edit]

Learning more about the structure of Apertium dictionaries and tools
Discussing questions about data types and language pairs to work with
Looking for a suitable data

Work period[edit]

Part 1, weeks 1-4:[edit]

Week 1: collecting and parsing the data, doing preprocessing, if needed, improving the existing code
Week 2: improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)
Week 3: making experiments with data, extracting triplets
Week 4: searching of extracted postediting operations that actually improve the quality of translation
Deliverable #1, June 11 - 15

Part 2, weeks 5-8:[edit]

Week 5: studying and classifying of successful postediting operations
Week 6: studying and classifying of successful postediting operations
Week 7: studying and classifying of successful postediting operations
Week 8: studying and classifying of successful postediting operations
Deliverable #2, July 9 - 13

Part 3, weeks 9-12:[edit]

Week 9: writing tools for inserting extracted information in a language pair
Week 10: writing tools for inserting extracted information in a language pair
Week 11: testing, fixing bugs
Week 12: cleaning up the code, writing documentation
Final evaluation, August 6 - 14
Project completed: The toolbox for automatic improvement of lexical component of a language pair.

Also I am going to write short notes about work process on the page of my project during the whole summer.

Non-Summer-of-Code plans you have for the Summer[edit]

I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge[edit]

https://github.com/deltamachine/naive-automatic-postediting

parse_ct_json.py: A script that parses Mediawiki JSON file and splits the whole corpus on training and test sets of a given size.
estimate_changes.py: A script that takes a file generated by apply_postedits.py and scores sentences which were processed with postediting rules on a language model.

In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):

Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in cleaned_learn_postedits.py