Difference between revisions of "User:Deltamachine/proposal"

From Apertium
Jump to navigation Jump to search
Line 56: Line 56:
 
<li>Choosing language pairs, with which shallow function labeller will work.</li>
 
<li>Choosing language pairs, with which shallow function labeller will work.</li>
 
<li>Choosing the most appropriate Python ML library (maybe it will be Tensorflow, maybe not)</li>
 
<li>Choosing the most appropriate Python ML library (maybe it will be Tensorflow, maybe not)</li>
</ul>
 
 
=== Work period ===
 
<ul>
 
<li>'''1st month:''' preparing the data, proceeding treebanks, creating datasets for training.</li>
 
<li>'''2nd month:''' working on a classifier, testing.</li>
 
<li>'''3rd month:''' integrating shallow function labeller to Apertium, testing, fixing bugs, writing documentation.</li>
 
 
</ul>
 
</ul>
   

Revision as of 14:16, 19 March 2017

Contact information

Name: Anna Kondratjeva

Location: Moscow, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

Github: http://github.com/deltamachine

IRC: deltamachine

SourceForge: deltamachine

Timezone: UTC+3

Skills and experience

Education: Bachelor's Degree in Fundamental and Computational Linguistics (2015 - expected 2019), National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

  • Theory of Language (Phonetics, Morphology, Syntax, Semantics)
  • Programming (Python)
  • Computer Tools for Linguistic Research
  • Language Diversity and Typology
  • Introduction to Data Analysis
  • Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)

Technical skills: Python (advanced), HTML, CSS, Flask, Django, SQLite (familiar)

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?

I am truly interested in machine translation, because it countains my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are builded, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium, on the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful.

Why is it that you are interested in Apertium?

There are three main reasons of why I want to work with Apertium:

1. Apertium works with a lot of minority languages, which is great, because it is pretty unusual for machine translation system: there are a lot of systems, which can translate from English to German pretty well, but there are a very few, which can translate, for example, from Kazakh to Tatar. Apertium is one of the said systems, and I believe they do a very important job.

2. Apertium does rule-based mashine translation, which is unusual too. But, as a linguist, I am very curious about learning more about this approach, because rule-based translation requires working with language structure and a big amount of language data.

3. Apertium community is very friendly, helpful, responsive and open to new members, which is very attractive.

Which of the published tasks are you interested in? What do you plan to do?

I would like to implement a prototype shallow syntactic function labeller.

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

In many languages (especially in ergative ones) it is very useful to know a syntatic function of a word for making an adequate translation. So, the shallow syntactic function labeller, as a part of Apertium system, will help to improve the quality of translation for many language pairs.

Work plan

Post application period

  • Getting closer with Apertium, reading documentation, playing around with its tools
  • Setting up Linux and getting used to it
  • Learning more about UD treebanks
  • Learning more about machine learning

Community bonding period

  • Choosing language pairs, with which shallow function labeller will work.
  • Choosing the most appropriate Python ML library (maybe it will be Tensorflow, maybe not)

Schedule

  • Week 1:
  • Week 2:
  • Week 3:
  • Week 4:
  • Deliverable #1, June 26 - 30:
  • Week 5:
  • Week 6:
  • Week 7:
  • Week 8:
  • Deliverable #2, July 24 - 28:
  • Week 9:
  • Week 10:
  • Week 11: evaluating the quality of the prototype, final testing
  • Week 12: cleaning up the code, writing documentation
  • Project completed: the prototype shallow syntactic function labeller, which is able to label sentences in supported languages.

Non-Summer-of-Code plans you have for the Summer

I have exams in the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to pass as many exams as possible ahead of schedule, in May, so it may be changed. After that I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge

https://github.com/deltamachine/wannabe_hackerman

  • flatten_conllu.py: A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:
    • Words with the @conj relation take the label of their head
    • Words with the @parataxis relation take the label of their head
  • calculate_accuracy_index.py: A script that does the following:
    • Takes -train.conllu file and calculates the table: surface_form - label - frequency
    • Takes -dev.conllu file and for each token assigns the most frequent label from the table
    • Calculates the accuracy index
  • label_asf: A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.