Difference between revisions of "User:Nstsj/Proposal"
(15 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
E-mail address: an.khorosheva@gmail.com |
E-mail address: an.khorosheva@gmail.com |
||
Other information that may be useful to contact you |
Other information that may be useful to contact you: |
||
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[https://github.com/nstsj] |
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[https://github.com/nstsj] |
||
Location: Moscow, Russia |
|||
⚫ | |||
⚫ | |||
== '''Why is it that you are interested in Apertium?''' == |
== '''Why is it that you are interested in Apertium?''' == |
||
Line 33: | Line 35: | ||
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. |
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. |
||
Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea. |
Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea. |
||
'''how and who it will benefit in society:''' |
'''how and who it will benefit in society:''' |
||
Line 39: | Line 40: | ||
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium. |
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium. |
||
== '''A timeline ( |
== '''A timeline (to be developed)''' == |
||
{| class="wikitable" border="1" |
{| class="wikitable" style="text-align: center border="1" |
||
|- |
|- |
||
! Weeks(dates) |
! Weeks(dates) |
||
Line 51: | Line 53: | ||
!Achieved results |
!Achieved results |
||
|- |
|- |
||
| |
|colspan="6"|'''Community bonding (7 - 27 May)''' |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
| |
| |
||
|- |
|- |
||
Line 62: | Line 73: | ||
| |
| |
||
| |
| |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
|- |
|- |
||
| Week 2 |
| Week 2 |
||
(3 June- 9 June) |
(3 June- 9 June) |
||
| |
| |
||
Increasing by |
|||
10% from |
|||
current |
|||
coverage |
|||
for cos-fra |
|||
translation |
|||
| |
| |
||
| |
| |
||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
⚫ | |||
|- |
|- |
||
| Week 3 |
| Week 3 |
||
Line 86: | Line 99: | ||
| |
| |
||
| |
| |
||
⚫ | |||
⚫ | |||
⚫ | |||
| |
| |
||
|- |
|- |
||
Line 95: | Line 105: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| (rest of Nouns) + ADJECTIVES |
| (rest of Nouns) + ADJECTIVES |
||
⚫ | |||
|- |
|- |
||
| Week 5 |
| Week 5 |
||
Line 105: | Line 115: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| PRONOUNS |
| PRONOUNS |
||
⚫ | |||
⚫ | |||
|colspan="8"|'''Interim Evaluation''' |
|||
⚫ | |||
|- |
|- |
||
| Week 6 |
| Week 6 |
||
Line 115: | Line 128: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| VERBS |
| VERBS |
||
⚫ | |||
|- |
|- |
||
| Week 7 |
| Week 7 |
||
Line 125: | Line 138: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| VERBS |
| VERBS |
||
⚫ | |||
|- |
|- |
||
| Week 8 |
| Week 8 |
||
Line 135: | Line 148: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| VERBALS |
| VERBALS |
||
| |
|||
|- |
|- |
||
| Week 9 |
| Week 9 |
||
Line 148: | Line 161: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
⚫ | |||
|- |
|- |
||
| Week 10 |
| Week 10 |
||
Line 164: | Line 177: | ||
(5 August - 11 August) |
(5 August - 11 August) |
||
| |
| |
||
93% - 95% for |
|||
⚫ | |||
cos-fra |
|||
translation |
|||
| |
|||
| |
| |
||
| |
| |
||
Line 171: | Line 187: | ||
| |
| |
||
|- |
|- |
||
| |
|colspan="8"|'''Stage II. Transfer rules & Evaluation''' |
||
| |
| |
||
|- |
|- |
||
Line 178: | Line 194: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| Lexical transfer rules |
| Lexical transfer rules |
||
| |
|||
|- |
|- |
||
| Week 13 |
| Week 13 |
||
Line 188: | Line 204: | ||
| |
| |
||
| |
| |
||
⚫ | |||
| |
| |
||
| |
| |
||
| |
| |
||
| Final preparations for evaluation |
| Final preparations for evaluation |
||
| |
|||
|} |
|} |
||
== '''Skills and qualifications''' == |
== '''Skills and qualifications''' == |
||
Line 214: | Line 229: | ||
'''programming skills:''' Python, Bash, R, Cypher. |
'''programming skills:''' Python, Bash, R, Cypher. |
||
My computer has MacOS so I'm familiar with basic Linux/Unix inventory |
My computer has MacOS so I'm familiar with basic Linux/Unix inventory. |
||
I have also started working on the coding task, here is the link |
I have also started working on the coding task, here is the link[https://github.com/nstsj/apertium-cos-fra] on GitHub |
||
== '''List any non-Summer-of-Code plans''' == |
== '''List any non-Summer-of-Code plans''' == |
||
'''employment:''' no work is planned during the summer |
|||
'''employment:''' I'm working part-time, so this won't affect my schedule as it's already a set routine taking me no more than 13hrs a week. |
|||
''' |
'''summer session:''' I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week. |
||
'''summer session:''' I'll have summer session university exams (??-??June). During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week. |
|||
In general, I'll be available for 30+ hours a week to work on my project. |
In general, I'll be available for 30+ hours a week to work on my project. |
||
'''other summer plans:''' Nothing specific is planned for |
'''other summer plans:''' Nothing specific is planned for the summer period since I want to spend my free time on the project. |
||
[[Category:GSoC 2019 student proposals]] |
Latest revision as of 19:50, 21 May 2019
Contents
General Info[edit]
Name: Anastasia Khorosheva
E-mail address: an.khorosheva@gmail.com
Other information that may be useful to contact you:
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[1]
Location: Moscow, Russia
Timezone: UTC+3
Why is it that you are interested in Apertium?[edit]
I'm interested because Apertium creates free linguistic resources that can be used for various educational and research purposes. Apertium mostly deals with low-resourced languages and this topic is among my academic interests. The main problem of such languages is that they lack written (and annotated) data, thus stopping us from applying most of ML-methods (for example, neural networks). A solution to this problem requires an entirely different approach, more rule-based. For example, using FST yields good results. I’m interested in developing research in this area. To sum up, people at Apertium are doing a lot of good work making low-resourced-language data available and I'd like to contribute to that.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I'm interested in adopting an unreleased language pair. The plan is to take French<->Corsican pair and write a language tool (e.g. machine translator) for it.
A proposal[edit]
a title: Adopt an unreleased language pair
reasons why Google and Apertium should sponsor it:
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.
how and who it will benefit in society:
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.
A timeline (to be developed)[edit]
Weeks(dates) | B-trimmed Cov.Goal | B-trimmed Cov.Reached | Testvoc | Evaluation | WER | Goals | Achieved results | |
---|---|---|---|---|---|---|---|---|
Community bonding (7 - 27 May) |
Studying Apertium Documentation, learning all the core ideas; Exploring .dix and other formats of a bilingual dictionary; Getting work plan more detailed (discussing with mentors) |
|||||||
Stage I. Morphological Analyzer and Bilingual Dictionary | ||||||||
Week 1
(27 May - 2 June) |
||||||||
Week 2
(3 June- 9 June) |
Increasing by 10% from current coverage for cos-fra translation |
NOUNS | ||||||
Week 3
(10 June - 16 June) |
||||||||
Week 4
(17 June - 23 June) |
(rest of Nouns) + ADJECTIVES | |||||||
Week 5
(24 June - 30 June) |
PRONOUNS | |||||||
Interim Evaluation | ||||||||
Week 6
(1 July - 7 July)) |
VERBS | |||||||
Week 7
(8 June - 14 July) |
VERBS | |||||||
Week 8
(15 July - 21 July) |
VERBALS | |||||||
Week 9
(22 July - 28 July) |
ADVERBS | |||||||
Week 10
(29 July - 4 August) |
||||||||
Week 11
(5 August - 11 August) |
93% - 95% for cos-fra translation |
|||||||
Stage II. Transfer rules & Evaluation | ||||||||
Week 12
(12 August - 18 August) |
Lexical transfer rules | |||||||
Week 13
(18 August - 25 August) |
Final preparations for evaluation |
Skills and qualifications[edit]
current field of study: Computational Linguistics, NLP, ML for cross-morphological methods
major: Linguistics
current degree: I'm currently at my first year of Master's program on Computational Linguistics at NRU Higher School of Economics, Russia (Moscow).
current projects: I'm involved in several university-based projects, such as creating a tool for cross-lingual morphological analysis (especially for low-resource languages) and creation of a graph-based ontology for scientific papers.
scientific interests: Linguistics (morphosyntax, semantics), Computational Linguistics, NLP(MT, WSD, IR, ontologies), ML methods in linguistics
languages: Russian (Native), English (Advanced), French, Spanish(Intermediate), Polish,Italian,German(can read)
programming skills: Python, Bash, R, Cypher.
My computer has MacOS so I'm familiar with basic Linux/Unix inventory. I have also started working on the coding task, here is the link[2] on GitHub
List any non-Summer-of-Code plans[edit]
employment: no work is planned during the summer
summer session: I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.
In general, I'll be available for 30+ hours a week to work on my project.
other summer plans: Nothing specific is planned for the summer period since I want to spend my free time on the project.