User:Deltamachine/proposal

1 Contact information
2 Skills and experience
3 Why is it you are interested in machine translation?
4 Why is it that you are interested in Apertium?
5 Which of the published tasks are you interested in? What do you plan to do?
6 Reasons why Google and Apertium should sponsor it
7 A description of how and who it will benefit in society
8 Work plan
9 Non-Summer-of-Code plans you have for the Summer
10 Coding challenge

Contact information

Name: Anna Kondratjeva

Location: Moscow, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

Github: http://github.com/deltamachine

IRC: deltamachine

SourceForge: deltamachine

Timezone: UTC+3

Skills and experience

Education: Bachelor's Degree in Fundamental and Computational Linguistics (2015 - expected 2019), National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

Programming (Python)
Computer Tools for Linguistic Research
Theory of Language (Phonetics, Morphology, Syntax, Semantics)
Language Diversity and Typology
Introduction to Data Analysis
Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)

Technical skills: Python (advanced), HTML, CSS, Flask, Django, SQLite (familiar)

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?

I am deeply interested in machine translation, because it combines my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are builded, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium. On the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful.

Why is it that you are interested in Apertium?

There are three main reasons why I want to work with Apertium:

1. Apertium works with a lot of minority languages, which is great, because it is quite unusual for machine translation system. There are a lot of systems, which can translate from English to German well enough, but there are very few, which can translate, for example, from Kazakh to Tatar. Apertium is one of the said systems, and I believe they do a very important job.

2. Apertium does rule-based machine translation, which is unusual too. But as a linguist I am very curious about learning more about this approach, because rule-based translation requires close working with language structure and a big amount of language data.

3. Apertium community is very friendly, helpful, responsive and open to new members, which is very attractive.

Which of the published tasks are you interested in? What do you plan to do?

I would like to implement a shallow syntactic function labeller.

The first idea was to take an annotated corpus (dependency treebank in UD format) and calculate the table "surface form - label - frequency", then take a test corpus, assign the most frequent label from the table for each token in it and calculate the accuracy score. All scripts with descriptions are available in "Coding challenge" section.

It appeared that this approach shows acceptable results (for example, the accuracy score was 0.76 for Russian, 0.62 for English and 0.68 for Spanish), but we definitely may reach higher results.

So, the next idea is to use machine learning methods for creating a better prototype of shallow syntactic function labeller.

A brief concept: The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a sequence-to-sequence model trained on prepared datasets, which were made from parsed UD-treebanks. The dataset for an encoder contains sequences of morphological tags, the dataset for a decoder contains sequences of labels, in both cases one sequence is a one sentence. The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.

Reasons why Google and Apertium should sponsor it

Firstly, the shallow syntactic function labeller will help to improve the quality of Apertium's translation. Secondly, there are currently not too many projects about using machine learning methods for rule-based machine translation and shallow syntactic function labelling, so my work will contribute to learning more abour this approach.

A description of how and who it will benefit in society

In many languages (especially in ergative ones) it is very useful to know a syntatic function of a word for making an adequate translation. So, the shallow syntactic function labeller, as a part of Apertium system, will help to improve the quality of translation for many language pairs.

Work plan

Post application period

Getting closer with Apertium and its tools, reading documentation
Setting up Linux and getting used to it
Learning more about machine learning, looking for more researches about sequence-to-sequence models
Learning more about UD treebanks

Community bonding period

Choosing language pairs, with which shallow function labeller will work.
Choosing the most suitable Python ML library (the task can be done with Tensorflow, but we may need a library, which is not so complex and has a simple runtime)
Thinking about how to integrate the classifier into Apertium
Learning more about possible problems

Work period

Part 1, weeks 1-4: preparing the data (includes a lot of thinking)

Week 1: writing scripts for parsing UD-treebanks
Week 2: writing scripts for parsing UD-treebanks, creating datasets
Week 3: creating datasets (in a few possible variants)
Week 4: writing a script for parsing a string in Apertium-stream-format into a sequence of morphological tags
Deliverable #1, June 26 - 30

Part 2, weeks 5-8: building the classifier

Week 5: building the model
Week 6: training the classifier, evaluating the quality of the prototype
Week 7: further training, working on improvements of the model
Week 8: final testing
Deliverable #2, July 24 - 28

Part 3, weeks 9-12: integrating the labeller into Apertium

Week 9: writing a script, which applies labels to the original string Apertium-stream-format, collecting all parts of the labeller together
Week 10: integrating the labeller into Apertium
Week 11: integrating the labeller into Apertium, testing, fixing bugs
Week 12: cleaning up the code, writing documentation
Project completed: the prototype shallow syntactic function labeller, which is able to label sentences well enough and works with several languages.

Also I am going to write short notes about work process on the page of my project during the whole summer.

Non-Summer-of-Code plans you have for the Summer

I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to take as many exams as possible in advance, in May, so it may be changed. After that I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge

https://github.com/deltamachine/wannabe_hackerman

flatten_conllu.py: A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:

Words with the @conj relation take the label of their head
Words with the @parataxis relation take the label of their head

calculate_accuracy_index.py: A script that does the following:

Takes -train.conllu file and calculates the table: surface_form - label - frequency
Takes -dev.conllu file and for each token assigns the most frequent label from the table
Calculates the accuracy index

label_asf: A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.

User:Deltamachine/proposal

Contents

Contact information

Skills and experience

Why is it you are interested in machine translation?

Why is it that you are interested in Apertium?

Which of the published tasks are you interested in? What do you plan to do?

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

Work plan

Post application period

Community bonding period

Work period

Part 1, weeks 1-4: preparing the data (includes a lot of thinking)

Part 2, weeks 5-8: building the classifier

Part 3, weeks 9-12: integrating the labeller into Apertium

Non-Summer-of-Code plans you have for the Summer

Coding challenge

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools