Difference between revisions of "User:Sokureo"
Line 98: | Line 98: | ||
<ul> |
<ul> |
||
<li>Programming languages: Python, R</li> |
<li>Programming languages: Python, R</li> |
||
<li>Web-design: HTML, CSS </li> |
<li>Web-design: HTML, CSS, Bootstrap </li> |
||
<li>Frameworks: Flask, Django</li> |
<li>Frameworks: Flask, Django</li> |
||
<li>Databases: SQLite, MySQL</li> |
<li>Databases: SQLite, MySQL</li> |
Revision as of 08:05, 27 March 2018
Contents
- 1 Contact information
- 2 Why is it that you are interested in Apertium?
- 3 What do you plan to do?
- 4 Reasons why Google and Apertium should sponsor it
- 5 A description of how and who it will benefit in society
- 6 Work plan
- 7 Skills and qualifications
- 8 Coding Challenge
- 9 Non-Summer-of-Code plans for the Summer
Contact information
Name: Elena Sokur |
E-mail address: pelmenium.sokurium@gmail.com |
IRC: sokureo |
Location: Moscow, Russia |
Timezone: UTC+3 |
GitHub: https://github.com/Sokureo |
Why is it that you are interested in Apertium?
Apertium is a free/open-source platform which adopts contemporary technologies to minority languages. As a linguist, I consider it my duty to keep minority languages alive and make them more accessible and popular technically is a good way to do that. As a computational linguist, it would be great to apply my knowledge in the language theory to machine translation.
What do you plan to do?
I plan to develop a language pair Udmurt and Komi-Zyrian (udm-kpv), which is now in incubator.
Reasons why Google and Apertium should sponsor it
Komi-Zyrian and Udmurt are very close-related languages which have common features in grammar and syntax. These facts make this project realistic, but still no one has adopted this pair (not only in Apertium). Currently the udm-kpv language pair is in the incubator: there are very few words in the bilingual dictionary and no rules. Monolingual dictionaries exist, but they do not have 100% percent coverage. I am going to fill the dictionaries, write transfer rules and make the translator usable in production.
A description of how and who it will benefit in society
As the result, we will have a free open-source qualitative Udmurt <--> Komi-Zyrian translator which can be used for reading texts or communicating.
Work plan
Materials I have
- Udm-Rus and Komi-Rus dictionaries
- Udm-Komi parallel texts
- Udmurt and Komi grammatics
Resources
- Udmurt corpus: http://web-corpora.net/UdmurtCorpus/search/?interface_language=ru
- Komi corpus: http://komicorpora.ru/
Current state
- apertium-udm: coverage ~77%
- apertium-kpv: coverage ~90%
- Project repository: https://github.com/apertium/apertium-udm-kpv
Post application period
- Getting closer with Apertium and machine translation
- Continue working on the story translation
- Improving knowledge of bash
Community bonding period
- Looking for other possible resources and literature
Work period
Week | Dates | Actions |
1 | 14.05 - 20.05 | filling in Udmurt dictionary |
2 | 21.05 - 27.05 | working on disambiguation |
3 | 28.05 - 03.06 | filling in Komi dictionary |
4 | 04.06 - 10.06 | working on disambiguation |
midterm evaluation | ||
5 | 11.06 - 17.06 | aligning parallel texts |
6-7 | 18.06 - 25.06 | writing scripts for getting translations |
8 | 26.06 - 08.07 | filling in bilingual dictionary |
midterm evaluation | ||
9-10 | 09.07 - 22.07 | writing transfer rules |
11 | 23.07 - 29.07 | testing the translator |
12 | 30.07 - 05.08 | cleaning up, writing documentation, releasing |
Project completed |
Skills and qualifications
I am a 3rd-year student of the Bachelor's programme "Fundamental and Computational Linguistics" in National Research University Higher School of Economics (NRU HSE), Russia.
Main university courses:
- Programming (Python, R)
- Computer Tools for Linguistic Research
- Theory of Language (Phonetics, Morphology, Syntax, Semantics, Discourse)
- Machine Learning
- Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)
- Theory of Algorithms
Technical skills:
- Programming languages: Python, R
- Web-design: HTML, CSS, Bootstrap
- Frameworks: Flask, Django
- Databases: SQLite, MySQL
Field work experience:
2 years of expeditions in:
- Beserman dialect of Udmurt
- Hill Mari
Languages: Russian (native), English (advanced), German (intermediate), Udmurt (reading), Hill Mari and Komi-Zyrian (basic knowledge of grammar).
Coding Challenge
1. Installed Apertium tools.
2. Installed kpv-udm language pair using this instruction: http://wiki.apertium.org/wiki/Udmurt_and_Komi. Morphological analyzers for Komi-Zyrian and Udmurt exist.
3. Found kpv-udm parallel texts (only one is aligned).
4. Added some words from the aligned text to the Bilingual dictionary and translated one sentence.
5. Estimated the coverage of Udmurt and Komi wikis: 77% for Udmurt and 90% for Komi.
6. Counted non-disambiguated words: 30% in Udmurt and 62% in Komi.
Non-Summer-of-Code plans for the Summer
From 17th June to 2nd July I will be in the Khanty expedition, but there is Internet connection here, so I will be able to cooperate.