User:Nstsj/Proposal

From Apertium
Revision as of 19:50, 21 May 2019 by Nstsj (talk | contribs) (→‎A timeline (yet to be developed))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

General Info[edit]

Name: Anastasia Khorosheva

E-mail address: an.khorosheva@gmail.com

Other information that may be useful to contact you:

@nstsj (Telegram), nstsj (IRC), nstsj(Github)[1]

Location: Moscow, Russia

Timezone: UTC+3

Why is it that you are interested in Apertium?[edit]

I'm interested because Apertium creates free linguistic resources that can be used for various educational and research purposes. Apertium mostly deals with low-resourced languages and this topic is among my academic interests. The main problem of such languages is that they lack written (and annotated) data, thus stopping us from applying most of ML-methods (for example, neural networks). A solution to this problem requires an entirely different approach, more rule-based. For example, using FST yields good results. I’m interested in developing research in this area. To sum up, people at Apertium are doing a lot of good work making low-resourced-language data available and I'd like to contribute to that.


Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested in adopting an unreleased language pair. The plan is to take French<->Corsican pair and write a language tool (e.g. machine translator) for it.


A proposal[edit]

a title: Adopt an unreleased language pair

reasons why Google and Apertium should sponsor it:

Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.

how and who it will benefit in society:

First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.


A timeline (to be developed)[edit]

Weeks(dates) B-trimmed Cov.Goal B-trimmed Cov.Reached Testvoc Evaluation WER Goals Achieved results
Community bonding (7 - 27 May)

Studying Apertium Documentation, learning all the core ideas;

Exploring .dix and other formats of a bilingual dictionary;

Getting work plan more detailed (discussing with mentors)

Stage I. Morphological Analyzer and Bilingual Dictionary
Week 1

(27 May - 2 June)

Week 2

(3 June- 9 June)

Increasing by 10% from current coverage for cos-fra translation

NOUNS
Week 3

(10 June - 16 June)

Week 4

(17 June - 23 June)

(rest of Nouns) + ADJECTIVES
Week 5

(24 June - 30 June)

PRONOUNS
Interim Evaluation
Week 6

(1 July - 7 July))

VERBS
Week 7

(8 June - 14 July)

VERBS
Week 8

(15 July - 21 July)

VERBALS
Week 9

(22 July - 28 July)

ADVERBS
Week 10

(29 July - 4 August)

Week 11

(5 August - 11 August)

93% - 95% for cos-fra translation

Stage II. Transfer rules & Evaluation
Week 12

(12 August - 18 August)

Lexical transfer rules
Week 13

(18 August - 25 August)

Final preparations for evaluation

Skills and qualifications[edit]

current field of study: Computational Linguistics, NLP, ML for cross-morphological methods

major: Linguistics

current degree: I'm currently at my first year of Master's program on Computational Linguistics at NRU Higher School of Economics, Russia (Moscow).

current projects: I'm involved in several university-based projects, such as creating a tool for cross-lingual morphological analysis (especially for low-resource languages) and creation of a graph-based ontology for scientific papers.

scientific interests: Linguistics (morphosyntax, semantics), Computational Linguistics, NLP(MT, WSD, IR, ontologies), ML methods in linguistics

languages: Russian (Native), English (Advanced), French, Spanish(Intermediate), Polish,Italian,German(can read)

programming skills: Python, Bash, R, Cypher.

My computer has MacOS so I'm familiar with basic Linux/Unix inventory. I have also started working on the coding task, here is the link[2] on GitHub

List any non-Summer-of-Code plans[edit]

employment: no work is planned during the summer

summer session: I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.

In general, I'll be available for 30+ hours a week to work on my project.

other summer plans: Nothing specific is planned for the summer period since I want to spend my free time on the project.