Difference between revisions of "User:Zu-ann"

From Apertium
Jump to navigation Jump to search
 
(21 intermediate revisions by the same user not shown)
Line 2: Line 2:
==Contact information==
==Contact information==


{|
Name: Anna Zueva
|'''Name:''' Anna Zueva
E-mail address: anna.zueva.v@gmail.com
|-
Other information that may be useful to contact you (e.g. IRC):
|'''E-mail address:''' anna.zueva.v@gmail.com
IRC: zu_ann
|-
GitHub: https://github.com/zu-ann
|'''IRC:''' zu_ann

|-
|'''Location:''' Moscow, Russia
|-
|'''Timezone:''' UTC+3
|-
|'''GitHub:''' https://github.com/zu-ann
|-
|'''Forked repositories for coding challenge:'''
|-
| https://github.com/zu-ann/apertium-tat
|-
| https://github.com/zu-ann/apertium-bak
|-
| https://github.com/zu-ann/apertium-tat-bak
|}


== Why is it that you are interested in machine translation and in Apertium? ==
== Why is it that you are interested in machine translation and in Apertium? ==


Apertium is an free/open-source machine translation platform, which gives people the opportunity to get access to a large amount of information in other languages through fast and understandable translation to the language they know. Rule-based machine translation, which is used in Apertium, in contrast to other kinds of machine translation, relies on the linguistic descriptions of languages, grammars and vocabularies, so this is a practical use of the linguistic data we have, and I find it fascinating.
Apertium is a free/open-source machine translation platform, which gives people the opportunity to get access to a large amount of information in other languages through fast and understandable translation to the language they know. Rule-based machine translation, which is used in Apertium, in contrast to other kinds of machine translation, relies on the linguistic descriptions of languages, grammars and vocabularies, so this is a practical use of the linguistic data we have, and I find it fascinating.
Furthermore, I believe that all people should have access to a fast and user-friendly translator, which can translate to and from their native language. Unlike many other translation platforms, Apertium works with minority languages, so that speakers of these languages can have machine translations for their native language. I would be happy to have the opportunity to contribute to it.
Furthermore, I believe that all people should have access to a fast and user-friendly translator, which can translate to and from their native language. Unlike many other translation platforms, Apertium works with minority languages and provides such translator for these languages, so that speakers of these languages can have machine translations for their native language, as well as other people interested in minority languages. I would be happy to have the opportunity to contribute to it.



== Which of the published tasks are you interested in? What do you plan to do? ==
== Which of the published tasks are you interested in? What do you plan to do? ==


I am interested in developing an existing language pair Tatar and Bashkir (Bashkir -> Tatar), which is now in nursery.
I am planning to developing an existing language pair Tatar and Bashkir (Bashkir -> Tatar), which is now in nursery.



== Reasons why Google and Apertium should sponsor it ==
== Reasons why Google and Apertium should sponsor it ==


Tatar and Baskir are closely related languages, but Apertium currently does not offer translations between them, although it exactly specialises on closely related languages. Moreover, other machine translation platform (Yandex Translator), that support translations between Tatar and Bashkir, is not free/open-source (so, for example, not everyone can contribute to it) and belongs not to rule-based machine translation. Emplementing this pair will increase the number of Apertium users, as there are about 6,5 million Tatar speakers, more than 1,2 million Bashkir speakers and only one other existing translator for this pair.
Tatar and Baskir are closely related languages, but Apertium currently does not offer translations between them, although it exactly specialises on closely related languages. Moreover, the other machine translation platform that supports text translations between Tatar and Bashkir is Yandex.Translate. It is not free/open-source (so, for example, not everyone can contribute to it) and belongs not to rule-based machine translation. Due to statistical nature of the Yandex.Translate, sometimes its output contains words which were not in the input or which do not even exist. In my opinion, rule based machine translation for this pair, which I am going to develop, can do better in terms of WER. Emplementing this pair will also increase the number of Apertium users, as there are about 6,5 million Tatar speakers and more than 1,2 million Bashkir speakers.



== A description of how and who it will benefit in society ==
== A description of how and who it will benefit in society ==


There are 6,5 million Tatar speakers and more than 1,2 million Bashkir speakers, who will get an opportunity to automatically translate from Baskir to Tatar.
There are 6,5 million Tatar speakers and more than 1,2 million Bashkir speakers, who will get an opportunity to automatically translate from Baskir to Tatar, using a stable rule-based machine translator.
Furthermore, the Bashkir language has a status of a minority language, so the release of the language pair will serve to support and promote this language. Native speakers of Tatar will be able to translate needed information from Bashkir and read this information in their native language.
Furthermore, the Bashkir language has a status of an endangered language, so the release of the language pair will serve to support and promote this language. Native speakers of Tatar will be able to translate needed information from Bashkir and read it in their native language.
Besides, this


== Work plan ==


'''Available resources'''
=== Work plan ===
* Dictionaries:
** GLOSBE Tatar-Bashkir Dictionary https://glosbe.com/tt/ba
** Russian-Tatar and Tatar-Russian online dictionaries http://www.tatar.com.ru/, http://suzlek.tatarstan.ru/, http://suzlek.antat.ru/
** Russian-Bashkir and Bashkir-Russian online dictionaries http://huzlek.bashqort.com/
** and printed dictionaries


* Resources for Bashkir corpus:
* Week 0: collecting Tatar and Bashkir corpora, scraping a parallel corpus, making a frequency dictionary
** news portals, for example: http://www.bashinform.ru/
* Week 1 (14.05 - 20.05): adding basic numerals and postpositions
* Week 2 (21.05 - 27.05): adding conjunctions
* Week 3 (28.05 - 03.06): adding adverbs
* Week 4 (04.06 - 10.06): adding pronouns and determiners


* Resources for Tatar corpus:
* '''Deliverable #1'''
** news portals, for example: http://tatarstan.ru/


* Resources for parallel corpus:
* Week 5 (11.06 - 17.06): adding adjectives and adverbs
** online newspapers both in Tatar and Bashkir, for example: https://ru.wikipedia.org/wiki/Районные_газеты_Башкортостана, http://rbsmi.ru/belizv_t/news/ in Tatar and http://rbsmi.ru/belizv_b/news/ in Bashkir
* Week 6 (18.06 - 24.06): Midterm evaluation.
** Bible translations http://ibt.org.ru/ru/
* Week 7 (25.06 - 01.07): adding nouns
* Week 8 (02.07 - 08.07):


* '''Deliverable #2'''
'''Current state'''
* apertium-tat: coverage ~91%, stems 25,789 (http://wiki.apertium.org/wiki/Apertium-tat)

* apertium-bak: coverage ~66%, stems 2,524 (http://wiki.apertium.org/wiki/Apertium-bak)
* Week 9 (09.07 - 15.07): adding verbs
* apertium-tat-bak.tat-bak.dix: about 2500-2800 words
* Week 10 (16.07 - 22.07):
* Week 11 (23.07 - 29.07):
* Week 12 (30.07 - 05.08): Final evaluation. Tidying up, releasing.

* '''Project completed'''


{|class="wikitable"
| '''Week''' || '''Dates''' || '''Actions'''
|-
|0|| || collecting Tatar and Bashkir corpora, scraping a parallel corpus, making a frequency dictionary
|-
|1|| 14.05 - 20.05 || adding numerals, postpositions and conjunctions
|-
|2|| 21.05 - 27.05 || adding pronouns and determiners
|-
|3 || 28.05 - 03.06 || adding adjectives and adverbs
|-
|4 || 04.06 - 10.06 || adding adjectives and adverbs
|-
| colspan="3" |'''Deliverable #1'''
|-
|5 || 11.06 - 17.06 || adding nouns
|-
|6 || 18.06 - 24.06 || adding nouns + Midterm evaluation
|-
|7 || 25.06 - 01.07 || adding nouns
|-
|8 || 02.07 - 08.07 || working on disambiguation
|-
| colspan="3" | '''Deliverable #2'''
|-
|9 || 09.07 - 15.07 || adding verbs
|-
|10 || 16.07 - 22.07 || adding verbs
|-
|11 || 23.07 - 29.07 || working on disambiguation
|-
|12 || 30.07 - 05.08) || Final evaluation. Tidying up, releasing.
|-
| colspan="3" | '''Project completed!'''
|}


== List your skills and qualifications ==
== List your skills and qualifications ==
Line 61: Line 106:
I am a 3rd-year bachelor student of Linguistics Faculty in National Research University Higher School of Economics (NRU HSE), Russia.
I am a 3rd-year bachelor student of Linguistics Faculty in National Research University Higher School of Economics (NRU HSE), Russia.


'''Technical skills:'''
Technical skills: Python (including BeautifulSoup, psycopg (library for PostgreSQL), Flask, Django, pyTelegramBotAPI, familiar with machine learning using numpy, pandas and sklearn), HTML and CSS (also using Bootstrap), XML, JSON, R.
*Python (including BeautifulSoup, psycopg (library for PostgreSQL), Flask, Django, pyTelegramBotAPI, familiar with machine learning using numpy, pandas and sklearn)
* HTML and CSS (also using Bootstrap 4)
* XML, JSON
* R


Languages: Russian (native), English (advanced), Spanish and German (intermediate), French (pre-intermediate), basic knowledge of grammar in Tatar and Bashkir.
'''Languages:''' Russian (native), English (advanced), Spanish and German (intermediate), French (pre-intermediate), basic knowledge of grammar in Tatar and Bashkir.

== Non-Summer-of-Code plans you have for the Summer ==
== Non-Summer-of-Code plans you have for the Summer ==



Latest revision as of 16:01, 27 March 2018

Contact information[edit]

Name: Anna Zueva
E-mail address: anna.zueva.v@gmail.com
IRC: zu_ann
Location: Moscow, Russia
Timezone: UTC+3
GitHub: https://github.com/zu-ann
Forked repositories for coding challenge:
https://github.com/zu-ann/apertium-tat
https://github.com/zu-ann/apertium-bak
https://github.com/zu-ann/apertium-tat-bak

Why is it that you are interested in machine translation and in Apertium?[edit]

Apertium is a free/open-source machine translation platform, which gives people the opportunity to get access to a large amount of information in other languages through fast and understandable translation to the language they know. Rule-based machine translation, which is used in Apertium, in contrast to other kinds of machine translation, relies on the linguistic descriptions of languages, grammars and vocabularies, so this is a practical use of the linguistic data we have, and I find it fascinating. Furthermore, I believe that all people should have access to a fast and user-friendly translator, which can translate to and from their native language. Unlike many other translation platforms, Apertium works with minority languages and provides such translator for these languages, so that speakers of these languages can have machine translations for their native language, as well as other people interested in minority languages. I would be happy to have the opportunity to contribute to it.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I am planning to developing an existing language pair Tatar and Bashkir (Bashkir -> Tatar), which is now in nursery.

Reasons why Google and Apertium should sponsor it[edit]

Tatar and Baskir are closely related languages, but Apertium currently does not offer translations between them, although it exactly specialises on closely related languages. Moreover, the other machine translation platform that supports text translations between Tatar and Bashkir is Yandex.Translate. It is not free/open-source (so, for example, not everyone can contribute to it) and belongs not to rule-based machine translation. Due to statistical nature of the Yandex.Translate, sometimes its output contains words which were not in the input or which do not even exist. In my opinion, rule based machine translation for this pair, which I am going to develop, can do better in terms of WER. Emplementing this pair will also increase the number of Apertium users, as there are about 6,5 million Tatar speakers and more than 1,2 million Bashkir speakers.

A description of how and who it will benefit in society[edit]

There are 6,5 million Tatar speakers and more than 1,2 million Bashkir speakers, who will get an opportunity to automatically translate from Baskir to Tatar, using a stable rule-based machine translator. Furthermore, the Bashkir language has a status of an endangered language, so the release of the language pair will serve to support and promote this language. Native speakers of Tatar will be able to translate needed information from Bashkir and read it in their native language.

Work plan[edit]

Available resources

Current state

Week Dates Actions
0 collecting Tatar and Bashkir corpora, scraping a parallel corpus, making a frequency dictionary
1 14.05 - 20.05 adding numerals, postpositions and conjunctions
2 21.05 - 27.05 adding pronouns and determiners
3 28.05 - 03.06 adding adjectives and adverbs
4 04.06 - 10.06 adding adjectives and adverbs
Deliverable #1
5 11.06 - 17.06 adding nouns
6 18.06 - 24.06 adding nouns + Midterm evaluation
7 25.06 - 01.07 adding nouns
8 02.07 - 08.07 working on disambiguation
Deliverable #2
9 09.07 - 15.07 adding verbs
10 16.07 - 22.07 adding verbs
11 23.07 - 29.07 working on disambiguation
12 30.07 - 05.08) Final evaluation. Tidying up, releasing.
Project completed!

List your skills and qualifications[edit]

I am a 3rd-year bachelor student of Linguistics Faculty in National Research University Higher School of Economics (NRU HSE), Russia.

Technical skills:

  • Python (including BeautifulSoup, psycopg (library for PostgreSQL), Flask, Django, pyTelegramBotAPI, familiar with machine learning using numpy, pandas and sklearn)
  • HTML and CSS (also using Bootstrap 4)
  • XML, JSON
  • R

Languages: Russian (native), English (advanced), Spanish and German (intermediate), French (pre-intermediate), basic knowledge of grammar in Tatar and Bashkir.

Non-Summer-of-Code plans you have for the Summer[edit]

In the end of May I will have to present my coursework in the university and in the third week of June I will have to take exams, so I will be able to work less time in the mentioned two weeks. Between these weeks and in the rest of the summer I have no non-GSoC plans and will be able to work full time and catch up with everything.