Difference between revisions of "User:Nstsj/Proposal"

From Apertium
Jump to navigation Jump to search
Line 213: Line 213:
'''programming skills:''' Python, Bash, R, Cypher.
'''programming skills:''' Python, Bash, R, Cypher.


My computer has MacOS so I'm familiar with basic Linux/Unix inventory
My computer has MacOS so I'm familiar with basic Linux/Unix inventory.
I have also started working on the coding task, here is the link to my GitHub
I have also started working on the coding task, here is the link to my GitHub


== '''List any non-Summer-of-Code plans''' ==
== '''List any non-Summer-of-Code plans''' ==

Revision as of 03:17, 6 April 2019

General Info

Name: Anastasia Khorosheva

E-mail address: an.khorosheva@gmail.com

Other information that may be useful to contact you (e.g. IRC): @nstsj (Telegram), nstsj (IRC), nstsj(Github)[1]

Timezone: UTC+3


Why is it that you are interested in Apertium?

I'm interested because Apertium creates free linguistic resources that can be used for various educational and research purposes. Apertium mostly deals with low-resourced languages and this topic is among my academic interests. The main problem of such languages is that they lack written (and annotated) data, thus stopping us from applying most of ML-methods (for example, neural networks). A solution to this problem requires an entirely different approach, more rule-based. For example, using FST yields good results. I’m interested in developing research in this area. To sum up, people at Apertium are doing a lot of good work making low-resourced-language data available and I'd like to contribute to that.


Which of the published tasks are you interested in? What do you plan to do?

I'm interested in adopting an unreleased language pair. The plan is to take French<->Corsican pair and write a language tool (e.g. machine translator) for it.


A proposal

a title: Adopt an unreleased language pair

reasons why Google and Apertium should sponsor it:

Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.

how and who it will benefit in society:

First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.

A timeline (yet to be developed)

Weeks(dates) B-trimmed Cov.Goal B-trimmed Cov.Reached Testvoc Evaluation WER Goals Achieved results
Community Bonding
Week 1

(27 May - 2 June)

Studying Apertium Documentation, learning all the core ideas;
Exploring .dix and other formats of a bilingual dictionary;
Getting work plan more detailed (discussing with mentors)
Week 2

(3 June- 9 June)

NOUNS
Week 3

(10 June - 16 June)

Stage I. Morphological Analyzer and Bilingual Dictionary
Week 4

(17 June - 23 June)

(rest of Nouns) + ADJECTIVES
Week 5

(24 June - 30 June)

PRONOUNS
Week 6

(1 July - 7 July))

VERBS
Week 7

(8 June - 14 July)

VERBS
Week 8

(15 July - 21 July)

VERBALS
Week 9

(22 July - 28 July)

ADVERBS
Week 10

(29 July - 4 August)

Week 11

(5 August - 11 August)

Stage II. Transfer rules & Evaluation
Week 12

(12 August - 18 August)

Lexical transfer rules
Week 13

(18 August - 25 August)

Final preparations for evaluation


Skills and qualifications

current field of study: Computational Linguistics, NLP, ML for cross-morphological methods

major: Linguistics

current degree: I'm currently at my first year of Master's program on Computational Linguistics at NRU Higher School of Economics, Russia (Moscow).

current projects: I'm involved in several university-based projects, such as creating a tool for cross-lingual morphological analysis (especially for low-resource languages) and creation of a graph-based ontology for scientific papers.

scientific interests: Linguistics (morphosyntax, semantics), Computational Linguistics, NLP(MT, WSD, IR, ontologies), ML methods in linguistics

languages: Russian (Native), English (Advanced), French, Spanish(Intermediate), Polish,Italian,German(can read)

programming skills: Python, Bash, R, Cypher.

My computer has MacOS so I'm familiar with basic Linux/Unix inventory. I have also started working on the coding task, here is the link to my GitHub

List any non-Summer-of-Code plans

employment: I'm working part-time, so this won't affect my schedule as it's already a set routine taking me no more than 13hrs a week.

internships: I've applied for LxMLS 2019 Summer School that will take place from July 11th to July 18th in Lisbon but I haven't received any confirmation yet. If I'm accepted, I plan to be able to work on a project for 15hrs during this week but then I'll switch to my usual schedule.

summer session: I'll have summer session university exams (??-??June). During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.

In general, I'll be available for 30+ hours a week to work on my project.

other summer plans: Nothing specific is planned for this summer since I want to spend my free time on improving my coding skills.