Difference between revisions of "User:GD/proposal"
Line 31: | Line 31: | ||
<p> I think there is a lot of math in language and graph representation of dictionaries is an exciting idea, because it adds some kind of cross-validation and internal system source of information. This information help to fill some lacunae that appear while creating a dictionary. This will improve a quality of translation as we manage to expand bidix. </p> |
<p> I think there is a lot of math in language and graph representation of dictionaries is an exciting idea, because it adds some kind of cross-validation and internal system source of information. This information help to fill some lacunae that appear while creating a dictionary. This will improve a quality of translation as we manage to expand bidix. </p> |
||
<p> Graph representation is very promising because it represents a philosophical model of a metalanguage knowledge. Knowing several languages, I know that it could be hard to recall some rare word and it is easier to translate from French to English and only then to Russian - because I forgot the word-pair between Russian and French. This graph representation works just like my memory: we cannot recall what is this word from L1 in L2. Hmm, we know L1-L3 and L3-L2. Oh, that's the link we need. Now we know L1-L3 word-pair. So, as we work on '''natural''' language processing, let's use '''natural''' instruments and systems as well.</p> |
<p> Graph representation is very promising because it represents a philosophical model of a metalanguage knowledge. Knowing several languages, I know that it could be hard to recall some rare word and it is easier to translate from French to English and only then to Russian - because I forgot the word-pair between Russian and French. This graph representation works just like my memory: we cannot recall what is this word from L1 in L2. Hmm, we know L1-L3 and L3-L2. Oh, that's the link we need. Now we know L1-L3 word-pair. So, as we work on '''natural''' language processing, let's use '''natural''' instruments and systems as well.</p> |
||
<p> |
|||
The main benefit of this project is reducing human labor and automatization of part of the dictionary development. |
|||
<ul> |
|||
<li>Finding lacunae in created dictionary (what words are missing).</li> |
|||
<li>Dictionary enrichment based on algorithm that offer variants and evaluation of these variants.</li> |
|||
<li>A potential base for creating new pairs.</li> |
|||
</ul> |
|||
</p> |
|||
== Coding Challenge == |
== Coding Challenge == |
Revision as of 09:47, 25 March 2018
Contents
- 1 Contact information
- 2 Am I good enough?
- 3 Why is it I am interested in machine translation? Why is it that I am interested in Apertium?
- 4 Which of the published tasks am I interested in? What do I plan to do?
- 5 Proposal
Contact information
Name: Evgenii Glazunov
Location: Moscow, Russia
University: NRU HSE, Moscow (National Research University Higher School of Economics), 3rd-year student
E-mail: glaz.dikobraz@gmail.com
IRC: G_D
Timezone: UTC+3
Github: https://github.com/dkbrz
Am I good enough?
Education: Bachelor's Degree in Fundamental and Computational Linguistics (2015-2019) at NRU HSE
Courses:
- Programming (Python, R, Flask, HTML, xml, Machine Learning)
- Morphology, Syntax, Semantics, Typology/Language Diversity
- Mathematics (Discrete Mathemathics, Linear Algebra and Calculus, Probability Theory, Mathematical Statistics, Computability and Complexity, Logic, Graphs and Topology, Theory of Algorithms)
- Latin, Latin in modern Linguistics, Ancient Literature
Languages: Russian (native), English (academic), French(A2-B1), Latin (a bit), German (A1)
Personal qualities: responsibility, punctuality, being hard-working, passion for programming, perseverance, resistance to stress
Why is it I am interested in machine translation? Why is it that I am interested in Apertium?
The speed of information circulation does not allow to spend time on human translation. I am truly interested in formal methods and models because they represent the way any language is constructed (as I see it). Despite some exceptions, in general language is very logical and the main problem is how to find proper systematic description. Apertium is a powerful platform that allows to build impressive rule-based engines. I think rule-based translation very promising if we provide enough data and an effective analysis
Which of the published tasks am I interested in? What do I plan to do?
I want to work on Graph dictionaries
Proposal
Why Google and Apertium should sponsor it? How and who it will benefit in society?
I think there is a lot of math in language and graph representation of dictionaries is an exciting idea, because it adds some kind of cross-validation and internal system source of information. This information help to fill some lacunae that appear while creating a dictionary. This will improve a quality of translation as we manage to expand bidix.
Graph representation is very promising because it represents a philosophical model of a metalanguage knowledge. Knowing several languages, I know that it could be hard to recall some rare word and it is easier to translate from French to English and only then to Russian - because I forgot the word-pair between Russian and French. This graph representation works just like my memory: we cannot recall what is this word from L1 in L2. Hmm, we know L1-L3 and L3-L2. Oh, that's the link we need. Now we know L1-L3 word-pair. So, as we work on natural language processing, let's use natural instruments and systems as well.
The main benefit of this project is reducing human labor and automatization of part of the dictionary development.
- Finding lacunae in created dictionary (what words are missing).
- Dictionary enrichment based on algorithm that offer variants and evaluation of these variants.
- A potential base for creating new pairs.
Coding Challenge
ipynb with current state of my coding challenge
Week by week work plan
Post application period
1. Refreshing and obtainig more specific knowledge about graph theory (during current course and in extra sources)
2. Thinking about statistical approach that can be relevant for this particular task
3. Theoretical research on general algorithmic optimisation
Community bonding period
1. Discussing my considerations and ideas with mentors
2. Icluding particularities and detail that are relevant
3. Correcting work plan according to new ideas
First phase
Week 1: Collecting data, preprocessing
Week 2: Experiments on small datasets with existing evaluation pairs (compare existing bidix with artificially created via graph)
Week 3: Error analysis and improvement ideas
Week 4: Improving code, preliminary running on medium data, first phase results, correcting plans
Second phase
Week 5: Optimization work based on medium-data experience
Week 6: Evaluating and improving metrics (experiments), estimate optimization
Week 7: Running on big data
Week 8: Finding errors and possible optimization, pre-results
Third phase
Week 9: Stable version on existing pairs, preprocessing of in-work pairs, experiments
Week 10: Final version of model, do the actual dictionary enrichment
Week 11: Evaluate results, estimate how much better dictionaries became
Week 12: Documentation, cleaning up the code
Final evaluation
Non-Summer-of-Code plans you have for the Summer
GSoC is the only project I have this summer. I have some exams in the end of June.