Difference between revisions of "User:GD/proposal"

From Apertium
Jump to navigation Jump to search
 
(27 intermediate revisions by 2 users not shown)
Line 2: Line 2:
<p>'''Name:''' Evgenii Glazunov </p>
<p>'''Name:''' Evgenii Glazunov </p>
<p>'''Location:''' Moscow, Russia</p>
<p>'''Location:''' Moscow, Russia</p>
<p>'''University:''' NRU HSE, Moscow (National Research University Higher School of Economics)</p>
<p>'''University:''' NRU HSE, Moscow (National Research University Higher School of Economics), 3rd-year student</p>
<p>'''E-mail:''' glaz.dikobraz@gmail.com</p>
<p>'''E-mail:''' glaz.dikobraz@gmail.com</p>
<p>'''IRC:''' G_D </p>
<p>'''IRC:''' G_D </p>
<p>'''Timezone:''' UTC+3</p>
<p>'''Timezone:''' UTC+3</p>
<p>'''Github:''' [github.com/dkbrz]</p>
<p>'''Github:''' https://github.com/dkbrz</p>


== Am I good enough? ==
== Am I good enough? ==
Line 12: Line 12:
<p>'''Courses:'''</p>
<p>'''Courses:'''</p>
<ul>
<ul>
<li> Programming (Python, Flask, HTML,xml) </li>
<li> Programming (Python, R, Flask, HTML, xml, Machine Learning) </li>
<li> Morphology, Syntax, Semantics, Typology/Language Diversity </li>
<li> Morphology, Syntax, Semantics, Typology/Language Diversity </li>
<li> Mathematics (Discrete Mathemathics, Linear Algebra and Calculus, Probability Theory, Mathematical Statistics, Computability and Complexity)</li>
<li> Mathematics (Discrete Mathematics, Linear Algebra and Calculus, Probability Theory, Mathematical Statistics, Computability and Complexity, Logic, Graphs and Topology, Theory of Algorithms)</li>
<li> Latin, Latin in modern Linguistics, Ancient Literature </li>
<li> Latin, Latin in modern Linguistics, Ancient Literature </li>
</ul>
</ul>
<p>'''Languages:''' Russian (native), English (academic), French, Latin, German (elementary) </p>
<p>'''Languages:''' Russian (native), English (academic), French(A2-B1), Latin (a bit), German (A1) </p>
<p>'''Personal qualities:''' responsibility, punctuality, being hard-working, passion for Latin and programming, perseverance, resistance to stress </p>
<p>'''Personal qualities:''' responsibility, punctuality, being hard-working, passion for programming, perseverance, resistance to stress </p>


== Why is it I am interested in machine translation? Why is it that I am interested in Apertium? ==
== Why is it I am interested in machine translation? Why is it that I am interested in Apertium? ==
<p>The speed of information circulation does not allow to spend time on human translation. I am truly interested in formal methods and models because they represent the way any language is constructed (as I see it). Despite some exceptions, in general language is very logical and the main problem is how to find proper systematic description. Apertium is a powerful platform that allows to build impressive rule-based engines. Languages like Latin are well-ordered, particularly their morphology, so it makes rule-based translation very promising.</p>
<p>The speed of information circulation does not allow to spend time on human translation. I am truly interested in formal methods and models because they represent the way any language is constructed (as I see it). Despite some exceptions, in general language is very logical and the main problem is how to find proper systematic description. Apertium is a powerful platform that allows to build impressive rule-based engines. I think rule-based translation very promising if we provide enough data and an effective analysis </p>


== Which of the published tasks am I interested in? What do I plan to do? ==
== Which of the published tasks am I interested in? What do I plan to do? ==
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Bilingual_dictionary_enrichment_via_graph_completion Bilingual dictionary enrichment via graph completion]
<p>I would like to add Latin-Russian language pair. I plan to do my best to reach high results, more details are given in Proposal part.</p>


The main idea is to take a graph representation of dictionaries and create tools to work on translation via edges between words in this graph. Graphs are very hard to work on because the complexity of calculations is high. But there are some tools and libraries that are created specially for these purposes and are effective. The developer task is to apply these instruments to specific type of dictionary information.
== Proposal ==


I worked with NetworkX as it is fully available for my current Windows, but I plan to work with Graph-tool that is much more efficient with large graphs.
''' Latin-Russian language pair '''


'''List of main ideas:'''
== Why Google and Apertium should sponsor it? How and who it will benefit in society? ==
<ul>
<p>Latin is the language of a great importance. Furthermore, studying Latin has a centuries-old history in Russia. Besides, Russian is spoken in different countries so much larger audience will benefit from this project. In Russia there are a lot of universities where students study Latin (faculties of Linguistics, Philology, History, Law, Medicine). Consequently, there is need for translation, not to mention a great heritage of ancient writers, poets and philosophers as Cicero, Catullus and others. Today only a couple of platforms have Latin-Russian pair, but they still have a lot work to do (sometimes translation is not readable at all and a lot of words aren't translated even in simple Mary and James story). So, a perspective of creating this pair is very promising. What is more, it is promising because these languages have a lot in common (morphological system, syntactic role marking).</p>
<li>Use classes to create the most appropriate type of information</li>
<li>Work with subraphs (connectivity components) to reduce the complexity of calculations</li>
<li>Filtration algorithms to gain previous aim</li>
<li>Vectorization to increase efficiency of all functions</li>
<li>Developing different metrics to reach quality of translation</li>
<li>Evaluation of these metrics </li>


'''Word object'''. Basic elements are lemma, language and POS information. Representation and String format can be modified according to developer needs. This one is like 'EN_first_adj' to check output of functions. One of important
== Coding Challenge ==
<p> I'm working on it now </p>
<p> ''' This section contains some information that clarifies laconic points in further workplan '''</p>
<p> ''' Current success : '''</p>
<p> 1. VirtualBox</p>
<p> 2. Apertium VirtualBox</p>
<p> 3. Core tools </p>
<p> 4. Latin and Russian packages </p>
<p> 5. Created a folder for a Lat-Rus project </p>
<p> 6. '''Dictionary work comments : '''</p>
<p> There are two existent packages: [https://svn.code.sf.net/p/apertium/svn/languages/apertium-rus Russian] and [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-lat Latin]. I have some comments:</p>
Existent dictionaries are good, especially this Russian one, but Latin is not complet at all. There is much work to do.
'''Latin :'''
<p>- Noun declension system isn't complete.</p>
<p>- There is no declention pattern for comparative and superlative adjectives. There should be a distinction between adjectives that have these forms and those that do not have them.</p>
<p>- At this moment I do not agree with "long vowels" in noun paradigms. I think if they are given in input, they can be processed with acx similarity rules. It is a bit strange because there are no long vowels in verb paradigm.</p>
<p>- There is only one type of verb paradigm - the first one for verbs with -are infinitives like amo|amare ('to love'). But there are a lot of other verbs from II, III (a,b) and IV types. </p>
<p>- Furthermore, there are some other types (verba deponentia et semideponentia etc) with non-standard paradigm. </p>
<p>- There are almost no irregular verbs (they are the most popular ones: 'can/be able', 'to carry', 'to want', 'to prefer', 'not to want', 'to happen') except for 'to be' and 'to go'.</p>
'''Russian :'''
<p>- Package is very good. For now I saw one strange thing - different present forms of verb "быть". No one speaks like this nowadays.</p>
<p>- It should be decided whether we need "е/ё" distinction in paradigms - it's good to write "ё" letter but there are only few contexts where it is really necessary - people rarely use it in real life. There are lots of paradigms that are written only to show accent change and appearance of "ё". </p>
<p> 7. '''Added some words''' from Mary and James story that fit existent paradigms </p>
<p> 8. Extended acx file to process long vowel forms and to keep in mind U|V and J|I unity in non-standardized texts </p>
<p> 9. Compiled automorf and autogen files, tried using echo with pre-existing and added words. It works. </p>
<p> 10. If we process Mary and James story wit latin automorf, we will see that are lots of common words are not in the dictionary (it confirms my first impression after looking through the dictionary). Russian aumorf is good and can process this story. </p>
<p> 11. '''Done some transfer rules and translated 2 sentences.''' </p>
<p> UBI EST JAMUS? Jamus et Maria in horto sunt. '''->''' ГДЕ ЕСТЬ ВАНЯ{\^?} Маша и Ваня ~в саду есть {\^.}. </p>
<p> I haven't done word order change yet, but this is already something. </p>
<p>'''Now I'm working on other sentences. ''' But I am not sure that I'll manage to do it before Proposal deadline</p>


class Word:
<p> ''' [https://github.com/glaz-dikobraz/GSoC_Lat-Rus My project on Github:]'''</p>
def __init__(self, lemma, lang, pos):
self.lemma = lemma
self.lang = lang
self.pos = pos
def __str__(self):
return (str(self.lang)+'_'+str(self.lemma)+'_'+str(self.pos))
__repr__ = __str__
def __eq__(self, other):
return self.lemma == other.lemma and self.lang == other.lang and self.pos == other.pos
def __hash__(self):
return hash(str(self))
</ul>


'''Filtration''' Filtration is necessary to filter sets of word by their parameters (in most cases, POS and language).

'''Subgraphs''' A general graph consists of lots of connectivity components so while searching we need to take into account only a part of it. It really increases efficiency.

'''Directed graphs'''

<ul>
<li> take into account LR and RL only cases </li>
<li> avoid some cycles </li>
<li> we can use directed in-edges for target language in translation subgraph to define it as a finite state in finite-state machine. So we do not go outside the node because we have already found our translation (a simple path from word to target language word)</li>
</ul>

The last one is very important because it turned out that there is an endless loop problem that coul be solved by subgraphing but this is unefficient comparing to finite state solution for various resons: big n potential and logically it seems to be more natural.

'''Vectorization''' Vectorizing functions and avoiding cycles really affects efficiency.

'''Metrics''' It is possibly the most important thing as we need to evaluate variants. The list of possible translation can be long as well as paths that lead to these final nodes. So to choose which one is the best we need to find a formula (or a set of formulae - this is better). And then choose the best one. I think of following algorithm:

<ul>
<li> take a general graph without one pair </li>
<li> run translation for this pair, find variants chosen by all these formulae </li>
<li> get accuracy comparing with existing translations </li>
</ul>


So after running on different language pair, we get plenty of data to choose one or a composition.

The '''result''' of this work will be a tool that can check dictionaries and find new word-pairs that can be included in bidix. And generate insertions for dictionaries.

See some examples of my ideas in [https://github.com/dkbrz/GSoC_2018/blob/master/Proposal/Coding%20challenge.ipynb '''my Python notebook''']

And there is a graph of released language pairs that shows possible ways of translation via other languages:
[[File:language_graph.png]]

= Proposal =

== Why Google and Apertium should sponsor it? How and who it will benefit in society? ==
<p> I think there is a lot of math in language and graph representation of dictionaries is an exciting idea, because it adds some kind of cross-validation and internal system source of information. This information help to fill some lacunae that appear while creating a dictionary. This will improve a quality of translation as we manage to expand bidix. </p>
<p> Graph representation is very promising because it represents a philosophical model of a metalanguage knowledge. Knowing several languages, I know that it could be hard to recall some rare word and it is easier to translate from French to English and only then to Russian - because I forgot the word-pair between Russian and French. This graph representation works just like my memory: we cannot recall what is this word from L1 in L2. Hmm, we know L1-L3 and L3-L2. Oh, that's the link we need. Now we know L1-L3 word-pair. So, as we work on '''natural''' language processing, let's use '''natural''' instruments and systems as well.</p>
<p>
The main benefit of this project is reducing human labor and automatization of part of the dictionary development.
<ul>
<li>Finding lacunae in created dictionary (what words are missing).</li>
<li>Dictionary enrichment based on algorithm that offer variants and evaluation of these variants.</li>
<li>A potential base for creating new pairs.</li>
</ul>
</p>

== Coding Challenge ==
[https://github.com/dkbrz/GSoC_2018/blob/master/Coding%20challenge.ipynb '''ipynb with current state of my coding challenge''']


== Week by week work plan ==
== Week by week work plan ==
=== Post application period ===
'''Week 0: until 05/29 : '''Preparation
1. Refreshing and obtainig more specific knowledge about graph theory (during current course and in extra sources)
Get familiar with Apertium system in details (wiki-sources, installing, creating files etc)
Get a corpora of texts for future test and frequency list by using both Wikipedia and
Latin and classic texts by Caesar, Cicero, Vergilius and others.
Plan every step and write down everything as formally as it is possible (in natural language)
Discuss details with a mentor
U\V and I\J problem


2. Thinking about statistical approach that can be relevant for this particular task
=== First phase ===


3. Theoretical research on general algorithmic optimisation
'''Week 1: 05\30 – 06\05 : '''
Dictionary: nouns & adjectives
(they have same declension patterns)
Add nouns to dictionary (monodix and bidix) - Latin dictionary suppose much more work
Describe morphology (add missed categories and paradigms)
Add adjectives (they are closely related to nouns)


=== Community bonding period ===
'''Week 2: 06\07 – 06\12 : '''
1. Discussing my considerations and ideas with mentors
Dictionary: verbs
Add verbs to dictionary (monodix and bidix)
Plan how to convert basic times


2. Icluding particularities and detail that are relevant
'''Week 3: 06\13 – 06\19 : '''
Transfer rules
Start writing transfer rules (editing bidix to add missed necessary categories)
Write basic transfer rules related to morphological transfers
Similar cases (case systems of these languages have a lot in common)


3. Correcting work plan according to new ideas
'''Week 4: 06\20 – 06\26 : '''
Extend dictionary
Add word from other classes to the dictionary (especially, check closed classes)
Finish all work scheduled for this period
Prepare for the first evaluation
Prepare detailed theoretical basis for the next phase


=== First phase ===
<p>'''Comment: first part is meant to be mostly technical and consist of some general and routine work.'''</p>
<p>'''Results: dictionary data, basic rules, morphological system, first testing'''</p>


'''Week 1: ''' Collecting data, preprocessing
=== Second phase ===


'''Week 2: ''' Experiments on small datasets with existing evaluation pairs (compare existing bidix with artificially created via graph)
'''Week 5: 06\27 – 07\03 : '''
Syntactic rules (word order)
Solve general word order and case problems


'''Week 6: 07\04 07\10 : '''
'''Week 3: ''' Error analysis and improvement ideas
Structures
Add basic structures as accusativus cum infinitivo, ablativus absolutus etc
Check and add popular collocations like 'res publica'


'''Week 4: ''' Improving code, preliminary running on medium data, first phase results, correcting plans
'''Week 7: 07\11 – 07\17 : '''
Extend dictionary
=== Second phase ===
Add more words from open classes (monodix and bidix)


'''Week 8: 07\18 07\24 : '''
'''Week 5: ''' Optimization work based on medium-data experience
Context based disambiguation


'''Week 6: ''' Evaluating and improving metrics (experiments), estimate optimization
<p>'''Comment: second part is meant to be main part that suppose working on translation algorithms.'''</p>
<p>'''Results: extended dictionary data, syntactic rules, beta version of the system is ready to be used, beta testing'''</p>


'''Week 7: ''' Running on big data
=== Third phase ===


'''Week 8: ''' Finding errors and possible optimization, pre-results
'''Week 9: 07\25 – 07\31 : '''
Syntactic rules 2
Extend number of syntactic rules
Testing


=== Third phase ===
'''Week 10: 08\01 – 08\07 : '''
Testing
Fixing issues that would appear
Extending data or rules (depending on previous results)


'''Week 9: ''' Stable version on existing pairs, preprocessing of in-work pairs, experiments
'''Week 11: 08\08 – 08\14 : '''
Vacation
I will be able to do some work, I will have a laptop but may have some troubles with internet access.


'''Week 10:''' Final version of model, do the actual dictionary enrichment
'''Week 12: 08\15 – 08\21 : '''
Final work on details
Put everything in order
Write documentation


'''Week 11: ''' Evaluate results, estimate how much better dictionaries became
<p>'''Comment: improving system as much as it possible'''</p>
<p>'''Results: all rules written, final version of the system, testing, bugs fixed'''</p>


'''Week 12: ''' Documentation, cleaning up the code


'''Final evaluation'''
'''Final evaluation'''


== Non-Summer-of-Code plans you have for the Summer ==
== Non-Summer-of-Code plans you have for the Summer ==
<p> GSoC is the only project I have this summer. I have a couple of exams on Week 4 so I planned a task that would be possible at that time and I planned vacation on Week 11 and scheduled more work in July and the beginning of August when I will be able to work more. </p>
<p> GSoC is the only project I have this summer. I have some exams in the end of June. </p>




[[Category:GSoC 2017 Student Proposals]]
[[Category:GSoC 2018 student proposals|GD]]

Latest revision as of 11:49, 6 May 2018

Contact information[edit]

Name: Evgenii Glazunov

Location: Moscow, Russia

University: NRU HSE, Moscow (National Research University Higher School of Economics), 3rd-year student

E-mail: glaz.dikobraz@gmail.com

IRC: G_D

Timezone: UTC+3

Github: https://github.com/dkbrz

Am I good enough?[edit]

Education: Bachelor's Degree in Fundamental and Computational Linguistics (2015-2019) at NRU HSE

Courses:

  • Programming (Python, R, Flask, HTML, xml, Machine Learning)
  • Morphology, Syntax, Semantics, Typology/Language Diversity
  • Mathematics (Discrete Mathematics, Linear Algebra and Calculus, Probability Theory, Mathematical Statistics, Computability and Complexity, Logic, Graphs and Topology, Theory of Algorithms)
  • Latin, Latin in modern Linguistics, Ancient Literature

Languages: Russian (native), English (academic), French(A2-B1), Latin (a bit), German (A1)

Personal qualities: responsibility, punctuality, being hard-working, passion for programming, perseverance, resistance to stress

Why is it I am interested in machine translation? Why is it that I am interested in Apertium?[edit]

The speed of information circulation does not allow to spend time on human translation. I am truly interested in formal methods and models because they represent the way any language is constructed (as I see it). Despite some exceptions, in general language is very logical and the main problem is how to find proper systematic description. Apertium is a powerful platform that allows to build impressive rule-based engines. I think rule-based translation very promising if we provide enough data and an effective analysis

Which of the published tasks am I interested in? What do I plan to do?[edit]

I would like to work on Bilingual dictionary enrichment via graph completion

The main idea is to take a graph representation of dictionaries and create tools to work on translation via edges between words in this graph. Graphs are very hard to work on because the complexity of calculations is high. But there are some tools and libraries that are created specially for these purposes and are effective. The developer task is to apply these instruments to specific type of dictionary information.

I worked with NetworkX as it is fully available for my current Windows, but I plan to work with Graph-tool that is much more efficient with large graphs.

List of main ideas:

  • Use classes to create the most appropriate type of information
  • Work with subraphs (connectivity components) to reduce the complexity of calculations
  • Filtration algorithms to gain previous aim
  • Vectorization to increase efficiency of all functions
  • Developing different metrics to reach quality of translation
  • Evaluation of these metrics
  • Word object. Basic elements are lemma, language and POS information. Representation and String format can be modified according to developer needs. This one is like 'EN_first_adj' to check output of functions. One of important class Word: def __init__(self, lemma, lang, pos): self.lemma = lemma self.lang = lang self.pos = pos def __str__(self): return (str(self.lang)+'_'+str(self.lemma)+'_'+str(self.pos)) __repr__ = __str__ def __eq__(self, other): return self.lemma == other.lemma and self.lang == other.lang and self.pos == other.pos def __hash__(self): return hash(str(self))

Filtration Filtration is necessary to filter sets of word by their parameters (in most cases, POS and language).

Subgraphs A general graph consists of lots of connectivity components so while searching we need to take into account only a part of it. It really increases efficiency.

Directed graphs

  • take into account LR and RL only cases
  • avoid some cycles
  • we can use directed in-edges for target language in translation subgraph to define it as a finite state in finite-state machine. So we do not go outside the node because we have already found our translation (a simple path from word to target language word)

The last one is very important because it turned out that there is an endless loop problem that coul be solved by subgraphing but this is unefficient comparing to finite state solution for various resons: big n potential and logically it seems to be more natural.

Vectorization Vectorizing functions and avoiding cycles really affects efficiency.

Metrics It is possibly the most important thing as we need to evaluate variants. The list of possible translation can be long as well as paths that lead to these final nodes. So to choose which one is the best we need to find a formula (or a set of formulae - this is better). And then choose the best one. I think of following algorithm:

  • take a general graph without one pair
  • run translation for this pair, find variants chosen by all these formulae
  • get accuracy comparing with existing translations


So after running on different language pair, we get plenty of data to choose one or a composition.

The result of this work will be a tool that can check dictionaries and find new word-pairs that can be included in bidix. And generate insertions for dictionaries.

See some examples of my ideas in my Python notebook

And there is a graph of released language pairs that shows possible ways of translation via other languages: Language graph.png

Proposal[edit]

Why Google and Apertium should sponsor it? How and who it will benefit in society?[edit]

I think there is a lot of math in language and graph representation of dictionaries is an exciting idea, because it adds some kind of cross-validation and internal system source of information. This information help to fill some lacunae that appear while creating a dictionary. This will improve a quality of translation as we manage to expand bidix.

Graph representation is very promising because it represents a philosophical model of a metalanguage knowledge. Knowing several languages, I know that it could be hard to recall some rare word and it is easier to translate from French to English and only then to Russian - because I forgot the word-pair between Russian and French. This graph representation works just like my memory: we cannot recall what is this word from L1 in L2. Hmm, we know L1-L3 and L3-L2. Oh, that's the link we need. Now we know L1-L3 word-pair. So, as we work on natural language processing, let's use natural instruments and systems as well.

The main benefit of this project is reducing human labor and automatization of part of the dictionary development.

  • Finding lacunae in created dictionary (what words are missing).
  • Dictionary enrichment based on algorithm that offer variants and evaluation of these variants.
  • A potential base for creating new pairs.

Coding Challenge[edit]

ipynb with current state of my coding challenge

Week by week work plan[edit]

Post application period[edit]

1. Refreshing and obtainig more specific knowledge about graph theory (during current course and in extra sources)

2. Thinking about statistical approach that can be relevant for this particular task

3. Theoretical research on general algorithmic optimisation

Community bonding period[edit]

1. Discussing my considerations and ideas with mentors

2. Icluding particularities and detail that are relevant

3. Correcting work plan according to new ideas

First phase[edit]

Week 1: Collecting data, preprocessing

Week 2: Experiments on small datasets with existing evaluation pairs (compare existing bidix with artificially created via graph)

Week 3: Error analysis and improvement ideas

Week 4: Improving code, preliminary running on medium data, first phase results, correcting plans

Second phase[edit]

Week 5: Optimization work based on medium-data experience

Week 6: Evaluating and improving metrics (experiments), estimate optimization

Week 7: Running on big data

Week 8: Finding errors and possible optimization, pre-results

Third phase[edit]

Week 9: Stable version on existing pairs, preprocessing of in-work pairs, experiments

Week 10: Final version of model, do the actual dictionary enrichment

Week 11: Evaluate results, estimate how much better dictionaries became

Week 12: Documentation, cleaning up the code

Final evaluation

Non-Summer-of-Code plans you have for the Summer[edit]

GSoC is the only project I have this summer. I have some exams in the end of June.