Difference between revisions of "User:Sokureo"

Latest revision as of 16:09, 27 March 2018

1 Contact information
2 Why is it that you are interested in Apertium and machine translation?
3 What do you plan to do?
4 Reasons why Google and Apertium should sponsor it
5 A description of how and who it will benefit in society
6 Work plan
7 Skills and qualifications
8 Coding Challenge
9 Non-Summer-of-Code plans for the Summer

Contact information[edit]

Name: Elena Sokur

E-mail address: pelmenium.sokurium@gmail.com

IRC: sokureo

Location: Moscow, Russia

Timezone: UTC+3

GitHub: https://github.com/Sokureo

Why is it that you are interested in Apertium and machine translation?[edit]

Apertium is a free/open-source platform which adopts contemporary technologies to minority languages. As a linguist, I consider it my duty to keep minority languages alive and make them more accessible and popular technically is a good way to do that. As a computational linguist, it would be great to apply my knowledge in the language theory to machine translation.
My first work was about part-of-speech tagging for Hittite language. For this purpose I used machine leaning and this was not so useful: machine learning gets successful when we have access to tones of data, but that is not what we have dealing with dead or minority languages. That is why I think machine translation is more reliable system and it makes me convinced in success.
I also study Beserman dialect of Udmurt in the village called Shamardan for the last 2 years -- it defines my interest and affection to this language. Beserman dialect differs from literary Udmurt in lexis and morphological markers, however I mostly understand Udmurt texts while practicing in reading.

What do you plan to do?[edit]

I plan to develop a language pair Udmurt and Komi-Zyrian (udm-kpv), which is now in incubator.

Reasons why Google and Apertium should sponsor it[edit]

Komi-Zyrian and Udmurt are very close-related languages which have common features in grammar and syntax. These facts make this project realistic, but still no one has adopted this pair (not only in Apertium). Currently the udm-kpv language pair is in the incubator: there are very few words in the bilingual dictionary and no rules. Monolingual dictionaries exist, but they do not have 100% percent coverage. I am going to fill the dictionaries, write transfer rules and make the translator usable in production.

A description of how and who it will benefit in society[edit]

As the result, we will have a free open-source qualitative Udmurt <--> Komi-Zyrian translator which can be used for reading texts or communicating.

Work plan[edit]

Materials I have

Udm-Rus and Komi-Rus dictionaries
Udm-Komi parallel texts (Fenno-Ugrica collection)
Udmurt and Komi grammatics
scraped Udmurt and Komi texts from Russian social network vk.com and other web-sites

Resources

Udmurt corpus: http://web-corpora.net/UdmurtCorpus/search/?interface_language=ru
Komi corpus: http://komicorpora.ru/
Fenno-Ugrica collection: https://fennougrica.kansalliskirjasto.fi/
Komi online library: http://komikyv.org/koi
scraped web-collections: http://web-corpora.net/wsgi3/minorlangs/download

Current state

apertium-udm: coverage ~77%
apertium-kpv: coverage ~90%
Project repository: https://github.com/apertium/apertium-udm-kpv

Post application period

Getting closer with Apertium and machine translation
Continue working on the story translation
Improving knowledge of bash

Community bonding period

Looking for other possible resources and literature

Possible problems

Parallel texts are a good source for a bilingual dictionary, but they are not aligned yet. How to solve: use existing aligning program (it is not ideal so some job has to be done manually).
There is no direct translators or dictionaries between Udmurt and Komi so another possible source for a bilingual dictionary is Udm-Rus and Komi-Rus dictionaries, but they are stored in pdf-format and have to be read firstly. How to solve: use existing Readers (the result needs to be checked).

Work period

Week	Dates	Actions
1	14.05 - 20.05	filling in Udmurt dictionary
2	21.05 - 27.05	working on disambiguation
3	28.05 - 03.06	filling in Komi dictionary
4	04.06 - 10.06	working on disambiguation
midterm evaluation
5	11.06 - 17.06	aligning parallel texts
6-7	18.06 - 25.06	writing scripts for getting translations
8	26.06 - 08.07	filling in bilingual dictionary
midterm evaluation
9-10	09.07 - 22.07	writing transfer rules
11	23.07 - 29.07	testing the translator
12	30.07 - 05.08	cleaning up, writing documentation, releasing
Project completed

Skills and qualifications[edit]

I am a 3rd-year student of the Bachelor's programme "Fundamental and Computational Linguistics" in National Research University Higher School of Economics (NRU HSE), Russia.

Main university courses:

Programming (Python, R)
Computer Tools for Linguistic Research
Theory of Language (Phonetics, Morphology, Syntax, Semantics, Discourse)
Machine Learning
Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)
Theory of Algorithms

Technical skills:

Programming languages: Python, R
Web-design: HTML, CSS, Bootstrap
Frameworks: Flask, Django
Databases: SQLite, MySQL

Field work experience:

2 years of expeditions in:

Beserman dialect of Udmurt
Hill Mari

Languages: Russian (native), English (advanced), German (intermediate), Udmurt (reading), Hill Mari and Komi-Zyrian (basic knowledge of grammar).

Coding Challenge[edit]

Repo: https://github.com/apertium/apertium-udm-kpv

1. Installed Apertium tools.

2. Installed kpv-udm language pair using this instruction: http://wiki.apertium.org/wiki/Udmurt_and_Komi. Morphological analyzers for Komi-Zyrian and Udmurt exist.

3. Found kpv-udm parallel texts (only one is aligned).

4. Added some words from the aligned text to the Bilingual dictionary and translated one sentence.

5. Estimated the coverage of Udmurt and Komi wikis: 77% for Udmurt and 90% for Komi.

6. Counted non-disambiguated words: 30% in Udmurt and 62% in Komi.

Non-Summer-of-Code plans for the Summer[edit]

From 17th June to 2nd July I will be in the Khanty expedition, but there is Internet connection here, so I will be able to cooperate.

@@ Line 15: / Line 15: @@
 |}
-== Why is it that you are interested in Apertium? ==
+== Why is it that you are interested in Apertium and machine translation? ==
-Apertium is a free/open-source platform which adopts contemporary technologies to minority languages. As a linguist, I consider it my duty to keep minority languages alive and make them more accessible and popular technically is a good way to do that. As a computational linguist, it would be great to apply my knowledge in the language theory to machine translation.
+Apertium is a free/open-source platform which adopts contemporary technologies to minority languages. As a linguist, I consider it my duty to keep minority languages alive and make them more accessible and popular technically is a good way to do that. As a computational linguist, it would be great to apply my knowledge in the language theory to machine translation.<br/>
+My first work was about part-of-speech tagging for Hittite language. For this purpose I used machine leaning and this was not so useful: machine learning gets successful when we have access to tones of data, but that is not what we have dealing with dead or minority languages. That is why I think machine translation is more reliable system and it makes me convinced in success.<br/>
+I also study Beserman dialect of Udmurt in the village called Shamardan for the last 2 years -- it defines my interest and affection to this language. Beserman dialect differs from literary Udmurt in lexis and morphological markers, however I mostly understand Udmurt texts while practicing in reading.
+== What do you plan to do? ==
+I plan to develop a language pair Udmurt and Komi-Zyrian (udm-kpv), which is now in incubator.
 == Reasons why Google and Apertium should sponsor it ==
@@ Line 25: / Line 30: @@
 == Work plan ==
-* Materials I have:
+'''Materials I have'''
-** Udm-Rus and Komi-Rus dictionaries
+* Udm-Rus and Komi-Rus dictionaries
-** Udm-Komi parallel texts
+* Udm-Komi parallel texts (Fenno-Ugrica collection)
-** Udmurt and Komi grammatics
+* Udmurt and Komi grammatics
+* scraped Udmurt and Komi texts from Russian social network vk.com and other web-sites
-* Resources:
+'''Resources'''
-** Udmurt corpus: http://web-corpora.net/UdmurtCorpus/search/?interface_language=ru
+* Udmurt corpus: http://web-corpora.net/UdmurtCorpus/search/?interface_language=ru
-** Komi corpus: http://komicorpora.ru/
+* Komi corpus: http://komicorpora.ru/
+* Fenno-Ugrica collection: https://fennougrica.kansalliskirjasto.fi/
+* Komi online library: http://komikyv.org/koi
+* scraped web-collections: http://web-corpora.net/wsgi3/minorlangs/download
 '''Current state'''
-<p>Using Udmurt and Komi wiki as a corpus:</p>
 * apertium-udm: coverage ~77%
 * apertium-kpv: coverage ~90%
@@ Line 44: / Line 52: @@
 * Continue working on the story translation<br/>
 * Improving knowledge of bash <br/>
 '''Community bonding period'''<br/>
 * Looking for other possible resources and literature
+'''Possible problems'''
+* Parallel texts are a good source for a bilingual dictionary, but they are not aligned yet. '''How to solve:''' use existing aligning program (it is not ideal so some job has to be done manually).
+* There is no direct translators or dictionaries between Udmurt and Komi so another possible source for a bilingual dictionary is Udm-Rus and Komi-Rus dictionaries, but they are stored in pdf-format and have to be read firstly. '''How to solve:''' use existing Readers (the result needs to be checked).
 '''Work period'''<br/>
@@ Line 96: / Line 109: @@
 <ul>
 <li>Programming languages: Python, R</li>
-<li>Web-design: HTML, CSS </li>
+<li>Web-design: HTML, CSS, Bootstrap </li>
 <li>Frameworks: Flask, Django</li>
 <li>Databases: SQLite, MySQL</li>
@@ Line 108: / Line 121: @@
 </ul>
-<p>'''Languages:''' Russian (native), English (advanced), German (intermediate), Udmurt (reading), Komi-Zyrian (basic knowledge of grammar).</p>
+<p>'''Languages:''' Russian (native), English (advanced), German (intermediate), Udmurt (reading), Hill Mari and Komi-Zyrian (basic knowledge of grammar).</p>
-== What do you plan to do? ==
-I plan to develop a language pair Udmurt and Komi-Zyrian (udm-kpv), which is now in incubator.
 == Coding Challenge ==
+Repo: https://github.com/apertium/apertium-udm-kpv
 <p>1. Installed Apertium tools.</p>
 <p>2. Installed kpv-udm language pair using this instruction: http://wiki.apertium.org/wiki/Udmurt_and_Komi. Morphological analyzers for Komi-Zyrian and Udmurt exist.</p>
@@ Line 123: / Line 134: @@
 == Non-Summer-of-Code plans for the Summer ==
 From 17th June to 2nd July I will be in the Khanty expedition, but there is Internet connection here, so I will be able to cooperate.
+[[Category:GSoC_2018_student_proposals|Sokureo]]

Difference between revisions of "User:Sokureo"

Latest revision as of 16:09, 27 March 2018

Contents

Contact information[edit]

Why is it that you are interested in Apertium and machine translation?[edit]

What do you plan to do?[edit]

Reasons why Google and Apertium should sponsor it[edit]

A description of how and who it will benefit in society[edit]

Work plan[edit]

Skills and qualifications[edit]

Coding Challenge[edit]

Non-Summer-of-Code plans for the Summer[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools