Difference between revisions of "User:GD/proposal"
Line 72: | Line 72: | ||
Dictionary: nouns & adjectives |
Dictionary: nouns & adjectives |
||
(they have same declension patterns) |
(they have same declension patterns) |
||
Add nouns to dictionary |
Add nouns to dictionary (monodix and bidix) |
||
Describe morphology |
Describe morphology |
||
Add prepositions (they are closely related to nouns) |
Add prepositions (they are closely related to nouns) |
||
Line 78: | Line 78: | ||
'''Week 2: 06\07 – 06\12 : ''' |
'''Week 2: 06\07 – 06\12 : ''' |
||
Dictionary: verbs |
Dictionary: verbs |
||
Add verbs to dictionary |
Add verbs to dictionary (monodix and bidix) |
||
Plan how to convert basic times |
Plan how to convert basic times |
||
Line 109: | Line 109: | ||
'''Week 7: 07\11 – 07\17 : ''' |
'''Week 7: 07\11 – 07\17 : ''' |
||
Extend dictionary |
Extend dictionary |
||
Add more words from open classes |
Add more words from open classes (monodix and bidix) |
||
'''Week 8: 07\18 – 07\24 : ''' |
'''Week 8: 07\18 – 07\24 : ''' |
Revision as of 09:05, 26 March 2017
Contents
- 1 Contact information
- 2 Am I good enough?
- 3 Why is it I am interested in machine translation? Why is it that I am interested in Apertium?
- 4 Which of the published tasks am I interested in? What do I plan to do?
- 5 Proposal
- 6 Why Google and Apertium should sponsor it? How and who it will benefit in society?
- 7 Coding Challenge
- 8 Week by week work plan
- 9 Non-Summer-of-Code plans you have for the Summer
Contact information
Name: Irina Glazunova
Location: Moscow, Russia
University: NRU HSE, Moscow (National Research University Higher School of Economics)
E-mail: glaz.dikobraz@gmail.com
IRC: G_D
Timezone: UTC+3
Am I good enough?
Education: Bachelor's Degree in Fundamental and Computational Linguistics (2015-2019) at NRU HSE
Courses:
- Programming (Python, Flask, HTML)
- Morphology, Syntax, Semantics, Typology/Language Diversity
- Mathematics (Discrete Mathemathics, Linear Algebra and Calculus, Probability Theory, Mathematical Statistics, Computability and Complexity)
- Latin, Latin in modern Linguistics, Ancient Literature
Languages: Russian (native), English (Academic), French, Latin
Personal qualities: responsibility, punctuality, being hard-working, passion for Latin and programming, perseverance, resistance to stress
Why is it I am interested in machine translation? Why is it that I am interested in Apertium?
The speed of information circulation does not allow to spend time on human translation. I am truly interested in formal methods and models because they represent the way any language is constructed (as I see them). Despite some exceptions, in general language is very logical and the main problem is how to find proper systematic description. Apertium is a powerful platform that allows to build impressive rule-based engines. Languages like Latin are well-ordered, particularly their morphology, so it makes rule-based translation very promising.
Which of the published tasks am I interested in? What do I plan to do?
I would like to add Latin-Russian language pair. I plan to do my best to reach high results, more details are given in Proposal part.
Proposal
Latin-Russian language pair
Why Google and Apertium should sponsor it? How and who it will benefit in society?
Latin is the language of a great importance. Furthermore, studying Latin has a centuries-old history in Russia. Besides, Russian is spoken in different countries so much larger audience will benefit from this project. In Russia there are a lot of universities where students study Latin (faculties of Linguistics, Philology, History, Law, Medicine). Consequently, there is need for translation, not to mention a great heritage of ancient writers, poets and philosophers as Cicero, Catullus and others. Today only a couple of platforms have Latin-Russian pair, but they still have a lot work to do. So, a perspective of creating this pair is very promising. What is more, it is promising because these languages have a lot in common (morphological system, syntactic role marking).
Coding Challenge
I'm working on it now
This section contains some information that clarifies laconic points in further workplan
Current success :
1. VirtualBox
2. Apertium VirtualBox
3. Core tools
4. Latin and Russian packages
5. Created a folder for a Lat-Rus project
6. Dictionary work comments :
Existent dictionaries are good, especially this Russian one, but Latin is not complet at all. There is much work to do.
- Noun declension system isn't complete.
- There is no declention pattern for comparative and superlative adjectives. There should be a distinction between adjectives that have these form and those that do not have them.
- At this moment I do not agree with "long vowels" in noun paradigms. I think if they are given in input, they can be processed with acx similarity rules. It is a bit strange because there are no long vowels in verb paradigm.
- There is only one type of verb paradigm - the first one for verbs with -are infinitives like amo|amare ('to love'). But there are a lot of other verbs from II, III (a,b) and IV types.
- Furthermore, there are some other types (verba deponentia et semideponentia etc) with non-standard paradigm.
- There are almost no irregular verbs (they are the most popular ones: 'can/be able', 'to carry', 'to want', 'to prefer', 'not to want', 'to happen') except for 'to be' and 'to go'.
7. Added some words from Mary and James story that fit existent paradigms
8. Extended acx file to process long vowel forms and to keep in mind U|V and J|I unity in non-standardized texts
9. Compiled automorf and autogen files, tried using echo with pre-existing and added words. It works.
10. If we process Mary and James story wit latin automorf, we will see that are lots of common words are not in the dictionary (it confirms first impression after looking through the dictionary). Russian aumorf is good and can process this sample story.
Now I'm working on bidix and transfer rules.
Week by week work plan
Week 0: until 05/29 : Preparation
Get familiar with Apertium system in details (wiki-sources, installing, creating files etc) Get a corpora of texts for future test and frequency list by using both Wikipedia and Latin and classic texts by Caesar, Cicero, Vergilius and others. Plan every step and write down everything as formally as it is possible (in natural language) Discuss details with a mentor U\V and I\J problem
First phase
Week 1: 05\30 – 06\05 : Dictionary: nouns & adjectives (they have same declension patterns)
Add nouns to dictionary (monodix and bidix) Describe morphology Add prepositions (they are closely related to nouns)
Week 2: 06\07 – 06\12 : Dictionary: verbs
Add verbs to dictionary (monodix and bidix) Plan how to convert basic times
Week 3: 06\13 – 06\19 : Transfer rules Start writing transfer rules
Write basic transfer rules related to morphological transfers Similar cases (case systems of these languages have a lot in common)
Week 4: 06\20 – 06\26 :
Extend dictionary Add word from other classes to the dictionary (especially, closed classes) Finish all work scheduled for this period Prepare for the first evaluation Prepare detailed theoretical basis for the next phase
Comment: first part is meant to be mostly technical and consist of some general and routine work.
Results: dictionary data, basic rules, morphological system, first testing
Second phase
Week 5: 06\27 – 07\03 : Syntactic rules (word order)
Solve general word order problems
Week 6: 07\04 – 07\10 : Structures
Add basic structures as accusativus cum infinitivo, ablativus absolutus etc
Week 7: 07\11 – 07\17 : Extend dictionary
Add more words from open classes (monodix and bidix)
Week 8: 07\18 – 07\24 : Context based disambiguation
Comment: second part is meant to be main part that suppose working on translation algorithms.
Results: extended dictionary data, syntactic rules, beta version of the system is ready to be used, beta testing
Third phase
Week 9: 07\25 – 07\31 : Syntactic rules 2
Extend number of syntactic rules Testing
Week 10: 08\01 – 08\07 : Testing
Fixing issues that would appear Extending data or rules (depending on previous results)
Week 11: 08\08 – 08\14 : Vacation
I will be able to do some work, I will have a laptop but may have some troubles with internet access.
Week 12: 08\15 – 08\21 : Final work on details
Put everything in order
Comment: improving system as much as it possible
Results: all rules written, final version of the system, testing, bugs fixed
Final evaluation
Non-Summer-of-Code plans you have for the Summer
GSoC is the only project I have this summer. I have a couple of exams on Week 4 so I planned a task that would be possible at that time and I planned vacation on Week 11 and scheduled more work in July and the beginning of August when I will be able to work more.