Difference between revisions of "User:Nstsj/Proposal"

From Apertium
Jump to navigation Jump to search
 
(15 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
E-mail address: an.khorosheva@gmail.com
 
E-mail address: an.khorosheva@gmail.com
   
Other information that may be useful to contact you (e.g. IRC):
+
Other information that may be useful to contact you:
  +
 
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[https://github.com/nstsj]
 
@nstsj (Telegram), nstsj (IRC), nstsj(Github)[https://github.com/nstsj]
   
  +
Location: Moscow, Russia
Timezone: UTC+3
 
   
 
Timezone: UTC+3
   
 
== '''Why is it that you are interested in Apertium?''' ==
 
== '''Why is it that you are interested in Apertium?''' ==
Line 33: Line 35:
 
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants.
 
Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants.
 
Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.
 
Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.
 
   
 
'''how and who it will benefit in society:'''
 
'''how and who it will benefit in society:'''
Line 39: Line 40:
 
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.
 
First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.
   
  +
== '''A timeline (yet to be developed)''' ==
+
== '''A timeline (to be developed)''' ==
{| class="wikitable" border="1"
+
{| class="wikitable" style="text-align: center border="1"
 
|-
 
|-
 
! Weeks(dates)
 
! Weeks(dates)
Line 51: Line 53:
 
!Achieved results
 
!Achieved results
 
|-
 
|-
| '''Community Bonding'''
+
|colspan="6"|'''Community bonding (7 - 27 May)'''
 
|
 
Studying Apertium Documentation, learning all the core ideas;
  +
 
Exploring .dix and other formats of a bilingual dictionary;
  +
 
Getting work plan more detailed (discussing with mentors)
 
|
 
|-
 
|colspan="8"|'''Stage I. Morphological Analyzer and Bilingual Dictionary'''
 
|
 
|
 
|-
 
|-
Line 62: Line 73:
 
|
 
|
 
|
 
|
 
|
|Studying Apertium Documentation, learning all the core ideas;
 
 
Exploring .dix and other formats of a bilingual dictionary;
 
 
Getting work plan more detailed (discussing with mentors)
 
 
|-
 
|-
 
| Week 2
 
| Week 2
 
(3 June- 9 June)
 
(3 June- 9 June)
 
|
 
|
  +
Increasing by
  +
10% from
  +
current
  +
coverage
  +
for cos-fra
  +
translation
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
| NOUNS
 
|
 
|
| NOUNS
 
 
|-
 
|-
 
| Week 3
 
| Week 3
Line 86: Line 99:
 
|
 
|
 
|
 
|
|
 
|-
 
| '''Stage I. Morphological Analyzer and Bilingual Dictionary'''
 
 
|
 
|
 
|-
 
|-
Line 95: Line 105:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| (rest of Nouns) + ADJECTIVES
 
| (rest of Nouns) + ADJECTIVES
 
|
 
|-
 
|-
 
| Week 5
 
| Week 5
Line 105: Line 115:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| PRONOUNS
 
| PRONOUNS
 
|
 
|-
  +
|colspan="8"|'''Interim Evaluation'''
 
|
 
|-
 
|-
 
| Week 6
 
| Week 6
Line 115: Line 128:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| VERBS
 
| VERBS
 
|
 
|-
 
|-
 
| Week 7
 
| Week 7
Line 125: Line 138:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| VERBS
 
| VERBS
 
|
 
|-
 
|-
 
| Week 8
 
| Week 8
Line 135: Line 148:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| VERBALS
 
| VERBALS
  +
|
 
|-
 
|-
 
| Week 9
 
| Week 9
Line 148: Line 161:
 
|
 
|
 
|
 
|
 
| ADVERBS
 
|
 
|
| ADVERBS
 
 
|-
 
|-
 
| Week 10
 
| Week 10
Line 164: Line 177:
 
(5 August - 11 August)
 
(5 August - 11 August)
 
|
 
|
  +
93% - 95% for
|
 
  +
cos-fra
  +
translation
  +
|
 
|
 
|
 
|
 
|
Line 171: Line 187:
 
|
 
|
 
|-
 
|-
| '''Stage II. Transfer rules & Evaluation'''
+
|colspan="8"|'''Stage II. Transfer rules & Evaluation'''
 
|
 
|
 
|-
 
|-
Line 178: Line 194:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| Lexical transfer rules
 
| Lexical transfer rules
  +
|
 
|-
 
|-
 
| Week 13
 
| Week 13
Line 188: Line 204:
 
|
 
|
 
|
 
|
|
 
 
|
 
|
 
|
 
|
 
|
 
|
 
| Final preparations for evaluation
 
| Final preparations for evaluation
  +
|
 
|}
 
|}
   
 
 
== '''Skills and qualifications''' ==
 
== '''Skills and qualifications''' ==
 
 
Line 214: Line 229:
 
'''programming skills:''' Python, Bash, R, Cypher.
 
'''programming skills:''' Python, Bash, R, Cypher.
   
My computer has MacOS so I'm familiar with basic Linux/Unix inventory
+
My computer has MacOS so I'm familiar with basic Linux/Unix inventory.
I have also started working on the coding task, here is the link to my GitHub
+
I have also started working on the coding task, here is the link[https://github.com/nstsj/apertium-cos-fra] on GitHub
   
 
== '''List any non-Summer-of-Code plans''' ==
 
== '''List any non-Summer-of-Code plans''' ==
 
 
  +
'''employment:''' no work is planned during the summer
'''employment:''' I'm working part-time, so this won't affect my schedule as it's already a set routine taking me no more than 13hrs a week.
 
   
'''internships:''' I've applied for LxMLS 2019 Summer School that will take place from July 11th to July 18th in Lisbon but I haven't received any confirmation yet. If I'm accepted, I plan to be able to work on a project for 15hrs during this week but then I'll switch to my usual schedule.
+
'''summer session:''' I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.
 
'''summer session:''' I'll have summer session university exams (??-??June). During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.
 
   
 
In general, I'll be available for 30+ hours a week to work on my project.
 
In general, I'll be available for 30+ hours a week to work on my project.
   
'''other summer plans:''' Nothing specific is planned for this summer since I want to spend my free time on improving my coding skills.
+
'''other summer plans:''' Nothing specific is planned for the summer period since I want to spend my free time on the project.
  +
  +
[[Category:GSoC 2019 student proposals]]

Latest revision as of 19:50, 21 May 2019

General Info[edit]

Name: Anastasia Khorosheva

E-mail address: an.khorosheva@gmail.com

Other information that may be useful to contact you:

@nstsj (Telegram), nstsj (IRC), nstsj(Github)[1]

Location: Moscow, Russia

Timezone: UTC+3

Why is it that you are interested in Apertium?[edit]

I'm interested because Apertium creates free linguistic resources that can be used for various educational and research purposes. Apertium mostly deals with low-resourced languages and this topic is among my academic interests. The main problem of such languages is that they lack written (and annotated) data, thus stopping us from applying most of ML-methods (for example, neural networks). A solution to this problem requires an entirely different approach, more rule-based. For example, using FST yields good results. I’m interested in developing research in this area. To sum up, people at Apertium are doing a lot of good work making low-resourced-language data available and I'd like to contribute to that.


Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested in adopting an unreleased language pair. The plan is to take French<->Corsican pair and write a language tool (e.g. machine translator) for it.


A proposal[edit]

a title: Adopt an unreleased language pair

reasons why Google and Apertium should sponsor it:

Although Cosrican is considered low-resource language, it’s vitally important for the region: according to an official survey run by the Collectivité territoriale de Corse in 2013, in Corsica, the Corsican language has a number of speakers between 86,800 and 130,200, out of a total population of 309,693 inhabitants. Currently, Corsican is maintained by being taught as a voluntary subject at school, but is required at the University of Corsica and is further used in various aspects of life. It is available through adult education. Having a language tool (i.e. a translator ) that is able to facilitate that process is a good and socially useful idea.

how and who it will benefit in society:

First of all among are those who want to learn and study Corsican (besides language learning, a translator would facilitate further linguistic research), Moreover having a new pair released will improve overall language coverage of Apertium.


A timeline (to be developed)[edit]

Weeks(dates) B-trimmed Cov.Goal B-trimmed Cov.Reached Testvoc Evaluation WER Goals Achieved results
Community bonding (7 - 27 May)

Studying Apertium Documentation, learning all the core ideas;

Exploring .dix and other formats of a bilingual dictionary;

Getting work plan more detailed (discussing with mentors)

Stage I. Morphological Analyzer and Bilingual Dictionary
Week 1

(27 May - 2 June)

Week 2

(3 June- 9 June)

Increasing by 10% from current coverage for cos-fra translation

NOUNS
Week 3

(10 June - 16 June)

Week 4

(17 June - 23 June)

(rest of Nouns) + ADJECTIVES
Week 5

(24 June - 30 June)

PRONOUNS
Interim Evaluation
Week 6

(1 July - 7 July))

VERBS
Week 7

(8 June - 14 July)

VERBS
Week 8

(15 July - 21 July)

VERBALS
Week 9

(22 July - 28 July)

ADVERBS
Week 10

(29 July - 4 August)

Week 11

(5 August - 11 August)

93% - 95% for cos-fra translation

Stage II. Transfer rules & Evaluation
Week 12

(12 August - 18 August)

Lexical transfer rules
Week 13

(18 August - 25 August)

Final preparations for evaluation

Skills and qualifications[edit]

current field of study: Computational Linguistics, NLP, ML for cross-morphological methods

major: Linguistics

current degree: I'm currently at my first year of Master's program on Computational Linguistics at NRU Higher School of Economics, Russia (Moscow).

current projects: I'm involved in several university-based projects, such as creating a tool for cross-lingual morphological analysis (especially for low-resource languages) and creation of a graph-based ontology for scientific papers.

scientific interests: Linguistics (morphosyntax, semantics), Computational Linguistics, NLP(MT, WSD, IR, ontologies), ML methods in linguistics

languages: Russian (Native), English (Advanced), French, Spanish(Intermediate), Polish,Italian,German(can read)

programming skills: Python, Bash, R, Cypher.

My computer has MacOS so I'm familiar with basic Linux/Unix inventory. I have also started working on the coding task, here is the link[2] on GitHub

List any non-Summer-of-Code plans[edit]

employment: no work is planned during the summer

summer session: I'll have university exam session in end of June. During this time I won’t be able to concentrate solely on the GSOC-project, but this will only take one week. I'll catch up on it during the following week.

In general, I'll be available for 30+ hours a week to work on my project.

other summer plans: Nothing specific is planned for the summer period since I want to spend my free time on the project.