User:Nstsj/Proposal
Contents
General Info
Name: Anastasia Khorosheva
E-mail address: an.khorosheva@gmail.com
Other information that may be useful to contact you:
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[1]
Location: Moscow, Russia
Timezone: UTC+3
Why is it that you are interested in Apertium?
I'm interested because Apertium creates free linguistic resources that can be used for various educational and research purposes. Apertium mostly deals with low-resourced languages and this topic is among my academic interests. The main problem of such languages is that they lack written (and annotated) data, thus stopping us from applying most of ML-methods (for example, neural networks). A solution to this problem requires an entirely different approach, more rule-based. For example, using FST yields good results. I’m interested in developing research in this area. To sum up, people at Apertium are doing a lot of good work making low-resourced-language data available and I'd like to contribute to that.
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in adopting an unreleased language pair. The plan is to take French<->Corsican pair and write a language tool (e.g. machine translator) for it.
A proposal
a title: Adopt an unreleased language pair
reasons why Google and Apertium should sponsor it:
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.
how and who it will benefit in society:
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.
A timeline (yet to be developed)
Weeks(dates) | B-trimmed Cov.Goal | B-trimmed Cov.Reached | Testvoc | Evaluation | WER | Goals | Achieved results | |
---|---|---|---|---|---|---|---|---|
Community bonding (7 - 27 May) |
Studying Apertium Documentation, learning all the core ideas; Exploring .dix and other formats of a bilingual dictionary; Getting work plan more detailed (discussing with mentors) |
|||||||
Stage I. Morphological Analyzer and Bilingual Dictionary | ||||||||
Week 1
(27 May - 2 June) |
||||||||
Week 2
(3 June- 9 June) |
Increasing by 10% from current coverage for cos-fra translation |
NOUNS | ||||||
Week 3
(10 June - 16 June) |
||||||||
Week 4
(17 June - 23 June) |
(rest of Nouns) + ADJECTIVES | |||||||
Week 5
(24 June - 30 June) |
PRONOUNS | |||||||
Interim Evaluation | ||||||||
Week 6
(1 July - 7 July)) |
VERBS | |||||||
Week 7
(8 June - 14 July) |
VERBS | |||||||
Week 8
(15 July - 21 July) |
VERBALS | |||||||
Week 9
(22 July - 28 July) |
ADVERBS | |||||||
Week 10
(29 July - 4 August) |
||||||||
Week 11
(5 August - 11 August) |
93% - 95% for cos-fra translation |
|||||||
Stage II. Transfer rules & Evaluation | ||||||||
Week 12
(12 August - 18 August) |
Lexical transfer rules | |||||||
Week 13
(18 August - 25 August) |
Final preparations for evaluation |
Skills and qualifications
current field of study: Computational Linguistics, NLP, ML for cross-morphological methods
major: Linguistics
current degree: I'm currently at my first year of Master's program on Computational Linguistics at NRU Higher School of Economics, Russia (Moscow).
current projects: I'm involved in several university-based projects, such as creating a tool for cross-lingual morphological analysis (especially for low-resource languages) and creation of a graph-based ontology for scientific papers.
scientific interests: Linguistics (morphosyntax, semantics), Computational Linguistics, NLP(MT, WSD, IR, ontologies), ML methods in linguistics
languages: Russian (Native), English (Advanced), French, Spanish(Intermediate), Polish,Italian,German(can read)
programming skills: Python, Bash, R, Cypher.
My computer has MacOS so I'm familiar with basic Linux/Unix inventory. I have also started working on the coding task, here is the link[2] on GitHub
List any non-Summer-of-Code plans
employment: no work is planned during the summer
summer session: I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.
In general, I'll be available for 30+ hours a week to work on my project.
other summer plans: Nothing specific is planned for the summer period since I want to spend my free time on the project.