Difference between revisions of "User:Deltamachine/proposal"
Jump to navigation
Jump to search
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) (Blanked the page) |
||
(53 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Contact information == |
|||
<p>'''Name:''' Anna Kondratjeva</p> |
|||
<p>'''Location:''' Moscow, Russia</p> |
|||
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p> |
|||
<p>'''Phone number:''' +79250374221</p> |
|||
<p>'''Github:''' http://github.com/deltamachine</p> |
|||
<p>'''IRC:''' deltamachine</p> |
|||
<p>'''Timezone:''' UTC+3</p> |
|||
== Skills and experience == |
|||
<p>'''Education:''' Bachelor's Degree in Fundamental and Computational Linguistics (2015 - expected 2019), National Research University «Higher School of Economics» (NRU HSE)</p> |
|||
<p>'''Main university courses:'''</p> |
|||
<ul> |
|||
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li> |
|||
<li>Programming (Python)</li> |
|||
<li>Computer Tools for Linguistic Research</li> |
|||
<li>Language Diversity and Typology</li> |
|||
<li>Introduction to Data Analysis</li> |
|||
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li> |
|||
</ul> |
|||
<p>'''Technical skills:''' Python (experienced, 1.5 years), HTML, CSS, Flask, Django, SQLite (familiar)</p> |
|||
<p>'''Projects and experience:''' http://github.com/deltamachine</p> |
|||
<p>'''Languages:''' Russian (native), English, German</p> |
|||
== Why is it you are interested in machine translation? == |
|||
I am truly interested in machine translation, because it countains my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are builded, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium, on the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful. |
|||
== Why is it that you are interested in Apertium? == |
|||
There are three main reasons of why I want to work with Apertium: |
|||
<p>1. Apertium works with a lot of minority languages, which is great, because it is pretty unusual for machine translation system: there are a lot of systems, which can translate from English to German pretty well, but there are a very few, which can translate, for example, from Kazakh to Tatar. Apertium is one of said systems, and I believe they do a very important job.</p> |
|||
<p>2. Apertium does rule-based mashine translation, which is unusual too. But as a linguist I am very curious about learning more about this approach, because rule-based translation requires working withlanguage structure and a big amount of language data.</p> |
|||
<p>3. Apertium community is very friendly, helpful and open to new members, which is very attractive.</p> |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
|||
I would like to implement a prototype shallow syntactic function labeller. |
|||
== Reasons why Google and Apertium should sponsor it == |
|||
== A description of how and who it will benefit in society == |
|||
== Work plan == |
|||
=== Post application period === |
|||
<ul> |
|||
<li>Getting closer with Apertium, reading documentation, playing around with its tools</li> |
|||
<li>Setting up Linux and getting used to it</li> |
|||
<li>Learning more about UD treebanks</li> |
|||
<li>Learning more about machine learning</li> |
|||
</ul> |
|||
=== Community bonding period === |
|||
<ul> |
|||
<li>Choosing language pairs, with which shallow function labeller will work.</li> |
|||
<li>Choosing the most appropriate Python ML library (maybe it will be Tensorflow, maybe not)</li> |
|||
</ul> |
|||
=== Work period === |
|||
<ul> |
|||
<li>'''1st month:''' preparing the data, proceeding treebanks, creating datasets for training.</li> |
|||
<li>'''2nd month:''' working on a classifier, testing.</li> |
|||
<li>'''3rd month:''' integrating shallow function labeller to Apertium, testing, fixing bugs, writing documentation.</li> |
|||
</ul> |
|||
=== Schedule === |
|||
<ul> |
|||
<li>'''Week 1:'''</li> |
|||
<li>'''Week 2:'''</li> |
|||
<li>'''Week 3:'''</li> |
|||
<li>'''Week 4:'''</li> |
|||
<li>'''Deliverable #1, June 26 - 30:'''</li> |
|||
<li>'''Week 5:'''</li> |
|||
<li>'''Week 6:'''</li> |
|||
<li>'''Week 7:'''</li> |
|||
<li>'''Week 8:'''</li> |
|||
<li>'''Deliverable #2, July 24 - 28:'''</li> |
|||
<li>'''Week 9:'''</li> |
|||
<li>'''Week 10:'''</li> |
|||
<li>'''Week 11:'''</li> |
|||
<li>'''Week 12:'''</li> |
|||
<li>'''Project completed'''</li> |
|||
</ul> |
|||
== Non-Summer-of-Code plans you have for the Summer == |
|||
I have exams in the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to pass as many exams as possible ahead of schedule, in May, so it may be changed. |
|||
After that I will be able to work full time and spend 45-50 hours per week on the task. |
|||
== Coding challenge == |
|||
<p>https://github.com/deltamachine/wannabe_hackerman</p> |
|||
<ul> |
|||
<li>''flatten_conllu.py:'' A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:</li> |
|||
<ul> |
|||
<li>Words with the @conj relation take the label of their head</li> |
|||
<li>Words with the @parataxis relation take the label of their head</li> |
|||
</ul> |
|||
<li>''calculate_accuracy_index.py:'' A script that does the following:</li> |
|||
<ul> |
|||
<li>Takes -train.conllu file and calculates the table: surface_form - label - frequency</li> |
|||
<li>Takes -dev.conllu file and for each token assigns the most frequent label from the table</li> |
|||
<li>Calculates the accuracy index</li> |
|||
</ul> |
|||
<li>''label_asf:'' A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus. |
|||
</li> |
|||
</ul> |