User:Deltamachine/proposal2017

From Apertium
Jump to navigation Jump to search

Contact information

Name: Anna Kondrateva

Location: Moscow, Russia

E-mail: an-an-kondratjeva@yandex.ru

Phone number: +79250374221

IRC: deltamachine

SourceForge: deltamachine

Timezone: UTC+3

Skills and experience

I am a second-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)

Main university courses:

  • Programming (Python)
  • Computer Tools for Linguistic Research
  • Theory of Language (Phonetics, Morphology, Syntax, Semantics)
  • Language Diversity and Typology
  • Introduction to Data Analysis
  • Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)

Technical skills: Python (advanced), HTML, CSS, Flask, Django, SQLite (familiar)

Projects and experience: http://github.com/deltamachine

Languages: Russian (native), English, German

Why is it you are interested in machine translation?

I am deeply interested in machine translation, because it combines my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are built, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium. On the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful.

Why is it that you are interested in Apertium?

There are three main reasons why I want to work with Apertium:

1. Apertium works with a lot of minority languages, which is great, because it is quite unusual for machine translation system. There are a lot of systems which can translate from English to German well enough, but there are very few which can translate, for example, from Kazakh to Tatar. Apertium is one of the said systems, and I believe they do a very important job.

2. Apertium does rule-based machine translation which is unusual too. But as a linguist I am very curious about learning more about this approach, because rule-based translation requires close work with language structure and a big amount of language data.

3. Apertium community is very friendly, helpful, responsive and open to new members, which is very attractive.

Which of the published tasks are you interested in? What do you plan to do?

I would like to implement a shallow syntactic function labeller.

The first idea was to take an annotated corpus (dependency treebank in UD format) and calculate the table "surface form - label - frequency", then take a test corpus, assign the most frequent label from the table for each token in it and calculate the accuracy score. All materials and scripts with descriptions are available in "Coding challenge" section.

It appeared that this approach shows acceptable results (for example, the accuracy score was 0.8 for Russian, 0.68 for English, 0.75 for Spanish and Finnish, 0.76 for Basque), but we definitely may reach higher results.

So, the next idea is to use machine learning methods for creating a better prototype of shallow syntactic function labeller.

A brief concept: The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a sequence-to-sequence model trained on prepared datasets, which were made from parsed syntax-labelled corpora (for instance, UD-treebanks).

The dataset for an encoder contains sequences of morphological tags, the dataset for a decoder contains sequences of labels, in both cases one sequence is a one sentence. UD-tags in datasets are replaced with suitable tags from Apertium tagset. Here is an example of this transformation:

UD Apertium
NOUN n
AUX vaux
INTJ ij

For each language will be created its own model, moreover, models will be trained just on sequences of morphological tags, not, for example, on tokens + tags. It means that models should not be overweight, so they will not slow down the entire workflow.

The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string. The result could look like this:

^vino<n><m><sg>$ => ^vino<n><m><sg><nsubj>$

So, in the end of the work there will be:

  • The labeller itself, which parses the string, restores a model for a needed language from a file, gives a sequence of tags to the model, gets a sequence of labels as an output and applies these labels to the original string
  • Files with trained models, which are saved in a suitable format (it could be, for example, JSON)

The task can be done with Tensorflow, but we may need a library, which is not so complex and has a simple runtime. The idea also can be realised with Keras (it has seq2seq add-on) and Theano as a backend, these libraries are not as massive as Tensorflow, which is usually being used for creating sequence-to-sequence models, so the labeller should work comparably fast. Moreover, Keras/Theano model can be run in regular hardware.

Integration into Apertium pipeline: As it has been said on the page of Google Summer of Code ideas, the main task is only to create a tool, which could be adapted and used for some language pairs after Google Summer of Code.

However, it seems that we are able to test this approach during the summer work. We may adapt the labeller for North Sámi - Norwegian Bokmål language pair, which already works with syntactic labelling, and then measure, how well it works. In the North Sámi - Norwegian Bokmål pipeline morphological disambiguation and syntax labelling were run as one CG module. Now they are split into two different parts (mor.rlx.bin and syn.rlx.bin), so we can try to replace the syntax labelling part with our machine-learned module and then test it.

Reasons why Google and Apertium should sponsor it

Adding the shallow function labeller in addition to approaches that are actually present (HMM part-of-speech tagging, constraint grammar, pattern-based syntactical transfer) should help to handle some existing problems in translating between languages, which are not closely related and belong to different types.

A few examples of such problems:

  • When you are working with an ergative language, it may be useful to know, if an absolutive is subject or object. Here is an example from Basque - English:
    Basque Current Apertium translation English
    Otsoa<abs><nsubj> etorri da The wolf he has come The wolf has come
    Ehiztariak<erg><nsubj> otsoa<abs><obj> harrapatu du The hunter the wolf he has caught The hunter has caught the wolf // The wolf was caught by the hunter

    If we would know the information about syntactic function of word in absolutive case, we could change the word order in translated English sentence and get the better translation.

  • There may be cases like classical Russian example "Мать любит дочь", which equally could mean "Mother loves daughter" or "Daughter loves mother". Machine translation systems always prefer the first variant, but due to the comparably free word order in Russian the meaning actually depends on syntactic functions of words.
  • Also there are a lot of cases when case translation is ambiguous and it could be really helpful for disambiguation to know the syntactic function of the word. Russian dative can be translated with English dative or with English nominative, but the choice depends on the syntactic function of the word. For example, "дай мне<dat><iobj> ручку" should be translated as "give me the pen", but "что мне<dat><obl> делать" should be translated as "what should I do".

So, it means that shallow function labelling is a good way to reach better quality of translation for "ergative - nominative", "synthetic - analytic" and "(comparably) free word order - strict word order" language pairs. In my opinion, the shallow syntactic function labeller trained on corpus data is more simple and effective way to label sentences than rule-based approach, because writing a good enough list of rules for determining a syntactic function of a word seems to be almost impossible even for a one language.

Also I believe that the shallow function labelling stage can help to make the chunking stage of translation easier and more accurate.

A description of how and who it will benefit in society

Firstly, the shallow syntactic function labeller, as a part of Apertium system, will help to improve the quality of translation for many language pairs.

Secondly, there are currently not too many projects about using machine learning methods for shallow syntactic function labelling, so my work will contribute to learning more abour this approach.

Work plan

Post application period

  • Getting closer with Apertium and its tools, reading documentation
  • Setting up Linux and getting used to it
  • Learning more about machine learning, looking for more researches about sequence-to-sequence models
  • Learning more about UD/VISL treebanks and tagsets and North Sámi syntax-labelled corpus

Community bonding period

  • Choosing language pairs, with which shallow function labeller will work. Currently I am thinking about Basque, English, Russian/Finnish, maybe Spanish, but it needs to be discussed. Also I will create a module for North Sámi → Norwegian Bokmål language pair, which already uses syntactic labelling, in order to evaluate the quality of the prototype.
  • Choosing the most suitable Python ML library
  • Thinking about how to integrate the classifier into North Sámi → Norwegian Bokmål pipeline
  • Learning more about possible problems, especially about discrepancies between all needed tagsets

Work period

    Part 1, weeks 1-4: preparing the data (includes a lot of thinking)

  • Week 1: writing a script for parsing UD-treebanks
  • Week 2: writing a script for parsing North Sámi syntax-labelled corpus
  • Week 3: comparing UD and Apertium tagsets, writing a script for replacing UD tags with suitable Apertium tags, writing scripts for handling other possible discrepancies between all needed tagsets
  • Week 4: creating datasets (in a few possible variants), writing a script for parsing a string in Apertium stream format into a sequence of morphological tags
  • Deliverable #1, June 26 - 30
  • Part 2, weeks 5-8: building the classifier

  • Week 5: building the model
  • Week 6: training the classifier, evaluating the quality of the prototype
  • Week 7: further training, working on improvements of the model
  • Week 8: final testing, writing a script, which applies labels to the original string in Apertium stream format
  • Deliverable #2, July 24 - 28
  • Part 3, weeks 9-12: testing the labeller on North Sámi → Norwegian Bokmål language pair

  • Week 9: collecting all parts of the labeller together, adding machine-learned module instead of the syntax labelling part of CG module
  • Week 10: adding machine-learned module instead of the syntax labelling part of CG module
  • Week 11: testing, fixing bugs
  • Week 12: cleaning up the code, writing documentation
  • Project completed: the prototype shallow syntactic function labeller, which is able to label sentences well enough and works with several languages.

Also I am going to write short notes about work process on the page of my project during the whole summer.

Non-Summer-of-Code plans you have for the Summer

I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to take as many exams as possible in advance, in May, so it may be changed. After that I will be able to work full time and spend 45-50 hours per week on the task.

Coding challenge

https://github.com/deltamachine/wannabe_hackerman

  • flatten_conllu.py: A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:
    • Words with the @conj relation take the label of their head
    • Words with the @parataxis relation take the label of their head
  • calculate_accuracy_index.py: A script that does the following:
    • Takes -train.conllu file and calculates the table: surface_form - label - frequency
    • Takes -dev.conllu file and for each token assigns the most frequent label from the table
    • Calculates the accuracy index
  • label_asf.py: A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.