Difference between revisions of "User:Deltamachine/proposal"

@@ Line 1: / Line 1: @@
-== Contact information ==
-<p>'''Name:''' Anna Kondratjeva</p>
-<p>'''Location:''' Moscow, Russia</p>
-<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p>
-<p>'''Phone number:''' +79250374221</p>
-<p>'''Github:''' http://github.com/deltamachine</p>
-<p>'''IRC:''' deltamachine</p>
-<p>'''SourceForge:''' deltamachine</p>
-<p>'''Timezone:''' UTC+3</p>
-== Skills and experience ==
-<p>'''Education:''' Bachelor's Degree in Fundamental and Computational Linguistics (2015 - expected 2019), National Research University «Higher School of Economics» (NRU HSE)</p>
-<p>'''Main university courses:'''</p>
-<ul>
-<li>Programming (Python)</li>
-<li>Computer Tools for Linguistic Research</li>
-<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li>
-<li>Language Diversity and Typology</li>
-<li>Introduction to Data Analysis</li>
-<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li>
-</ul>
-<p>'''Technical skills:''' Python (advanced), HTML, CSS, Flask, Django, SQLite (familiar)</p>
-<p>'''Projects and experience:''' http://github.com/deltamachine</p>
-<p>'''Languages:''' Russian (native), English, German</p>
-== Why is it you are interested in machine translation? ==
-I am truly interested in machine translation, because it countains my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are builded, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium, on the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful.
-== Why is it that you are interested in Apertium? ==
-There are three main reasons of why I want to work with Apertium:
-<p>1. Apertium works with a lot of minority languages, which is great, because it is quite unusual for machine translation system. There are a lot of systems, which can translate from English to German well enough, but there are a very few, which can translate, for example, from Kazakh to Tatar. Apertium is one of the said systems, and I believe they do a very important job.</p>
-<p>2. Apertium does rule-based machine translation, which is unusual too. But, as a linguist, I am very curious about learning more about this approach, because rule-based translation requires close working with language structure and a big amount of language data.</p>
-<p>3. Apertium community is very friendly, helpful, responsive and open to new members, which is very attractive.</p>
-== Which of the published tasks are you interested in? What do you plan to do? ==
-I would like to implement a shallow syntactic function labeller.
-'''A brief concept:''' The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a sequence-to-sequence model trained on prepared datasets, which were created from parsed UD-treebanks. The dataset for an encoder contains sequences of morphological tags, the dataset for a decoder contains sequences of labels, in both cases one sequence is a one sentence. The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.
-== Reasons why Google and Apertium should sponsor it ==
-Firstly, the shallow syntactic function labeller will help to improve the quality of Apertium's translation. Secondly, there are currently not too many projects about using machine learning methods for rule-based machine translation and shallow syntactic function labelling, so my work will help to learn more abour this approach.
-== A description of how and who it will benefit in society ==
-In many languages (especially in ergative ones) it is very useful to know a syntatic function of a word for making an adequate translation. So, the shallow syntactic function labeller, as a part of Apertium system, will help to improve the quality of translation for many language pairs.
-== Work plan ==
-=== Post application period ===
-<ul>
-<li>Getting closer with Apertium and its tools, reading documentation</li>
-<li>Setting up Linux and getting used to it</li>
-<li>Learning more about machine learning, looking for more researches about sequence-to-sequence models</li>
-<li>Learning more about UD treebanks</li>
-</ul>
-=== Community bonding period ===
-<ul>
-<li>Choosing language pairs, with which shallow function labeller will work.</li>
-<li>Choosing the most suitable Python ML library (the task can be done with Tensorflow, but we may need a library, which is not so complex and has a simple runtime)</li>
-<li>Thinking about how to integrate the classifier to Apertium</li>
-<li>Learning more about possible problems</li>
-</ul>
-=== Work period ===
-<ul>
-==== Part 1, weeks 1-4: preparing the data (includes a lot of thinking)====
-<p></p>
-<li>'''Week 1:''' writing scripts for parsing UD-treebanks</li>
-<li>'''Week 2:''' writing scripts for parsing UD-treebanks, creating datasets</li>
-<li>'''Week 3:''' creating datasets (in a few possible variants)</li>
-<li>'''Week 4:''' writing a script for parsing a string in Apertium-stream-format into a sequence of morphological tags </li>
-<li>'''Deliverable #1, June 26 - 30'''</li>
-<p></p>
-==== Part 2, weeks 5-8: building the classifier ====
-<p></p>
-<li>'''Week 5:''' building the model</li>
-<li>'''Week 6:''' training the classifier, evaluating the quality of the prototype</li>
-<li>'''Week 7:''' further training, working on improvements of the model</li>
-<li>'''Week 8:''' final testing</li>
-<li>'''Deliverable #2, July 24 - 28'''</li>
-<p></p>
-==== Part 3, weeks 9-12: integrating the labeller to Apertium ====
-<p></p>
-<li>'''Week 9:''' writing a script, which applies labels to the original string Apertium-stream-format, collecting all parts of the labeller together</li>
-<li>'''Week 10:''' integrating the labeller to Apertium</li>
-<li>'''Week 11:''' integrating the labeller to Apertium, testing, fixing bugs</li>
-<li>'''Week 12:''' cleaning up the code, writing documentation</li>
-<li>'''Project completed:''' the prototype shallow syntactic function labeller, which is able to label sentences well enough and works with several languages.</li>
-</ul>
-== Non-Summer-of-Code plans you have for the Summer ==
-I have exams in the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to pass as many exams as possible ahead of schedule, in May, so it may be changed.
-After that I will be able to work full time and spend 45-50 hours per week on the task.
-== Coding challenge ==
-<p>https://github.com/deltamachine/wannabe_hackerman</p>
-<ul>
-<li>''flatten_conllu.py:'' A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:</li>
-<ul>
-<li>Words with the @conj relation take the label of their head</li>
-<li>Words with the @parataxis relation take the label of their head</li>
-</ul>
-<li>''calculate_accuracy_index.py:'' A script that does the following:</li>
-<ul>
-<li>Takes -train.conllu file and calculates the table: surface_form - label - frequency</li>
-<li>Takes -dev.conllu file and for each token assigns the most frequent label from the table</li>
-<li>Calculates the accuracy index</li>
-</ul>
-<li>''label_asf:'' A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.
-</li>
-</ul>

Difference between revisions of "User:Deltamachine/proposal"

Latest revision as of 09:56, 23 March 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools