User:Deltamachine/proposal2018

From Apertium
Jump to navigation Jump to search

Contact information

Name: Anna Kondrateva

Location: Moscow, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

IRC: deltamachine

SourceForge: deltamachine

Timezone: UTC+3

Skills and experience

I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

  • Programming (Python, R)
  • Computer Tools for Linguistic Research
  • Theory of Language (Phonetics, Morphology, Syntax, Semantics)
  • Language Diversity and Typology
  • Machine Learning
  • Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)
  • Theory of Algorithms
  • Databases

Technical skills:

  • Programming languages: Python, R, Javascript
  • Web design: HTML, CSS
  • Frameworks: Flask, Django
  • Databases: SQLite, PostgreSQL, MySQL

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?

I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!

Why is it that you are interested in Apertium?

I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more. Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.

This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.

Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.

Which of the published tasks are you interested in? What do you plan to do?

I would like to work on improving language pairs by mining MediaWiki Content Translation postedits.

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.

Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.

Work plan

Post application period

Community bonding period

Work period

    Part 1, weeks 1-4:

  • Week 1:
  • Week 2:
  • Week 3:
  • Week 4:
  • Deliverable #1, June 26 - 30
  • Part 2, weeks 5-8:

  • Week 5:
  • Week 6:
  • Week 7:
  • Week 8:
  • Deliverable #2, July 24 - 28
  • Part 3, weeks 9-12:

  • Week 9:
  • Week 10:
  • Week 11: testing, fixing bugs
  • Week 12: cleaning up the code, writing documentation
  • Project completed:

Also I am going to write short notes about work process on the page of my project during the whole summer.

Non-Summer-of-Code plans you have for the Summer

I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge

https://github.com/deltamachine/naive-automatic-postediting

  • parse_ct_json.py: A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.
  • estimate_changes.py: A script that takes a file generated by apply_postedits.py and scores sentences which were processed with postediting rules on a language model.

Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in cleaned_learn_postedits.py

cleaned_learn_postedits.py was runned on English - Spanish train set of 500 sentences. List of learned potential postediting operations is stored in postediting_operations.txt. Then I applied these operations to the test set of 100 sentences. Results are stored in pe_sentences.txt. After that I scored these results on a language model using estimate_changes.py, scores are stored in scores.txt.