Difference between revisions of "User:Uliana/gsoc-propuesta"

From Apertium
Jump to navigation Jump to search
Line 84: Line 84:
 
'''Sicilian-Spanish bilingual package:''' [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/ here].
 
'''Sicilian-Spanish bilingual package:''' [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/ here].
   
'''These two scientific articles from Wikipedia can be now translated from Sicilian to Spanish''' [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/texts/tokamak-1.spa.txt here] and [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/texts/tokamak-2.spa.txt here].
+
'''These two scientific articles from Wikipedia can be now translated from Sicilian to Spanish:''' [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/texts/tokamak-1.spa.txt here] and [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/texts/tokamak-2.spa.txt here].
   
 
== Proposal and work plan ==
 
== Proposal and work plan ==

Revision as of 16:54, 23 March 2016

Contacts

Uliana Sentsova

E-mail: uliana.sentsova@gmail.com

Number: +7 (916) 774-95-30

Skype: ulyanasidorova

IRC channel: uliana at #apertium

Education and achievements

Lomonosov Moscow State University

Qualification: Bachelor in Linguistics (romance-german languages)

GPA: 10.0 / 10.0


National Research University „Higher School of Economics“

Qualification: Major in Natural Language Processing

Current GPA: 8.5 / 10.0


2015: Awardee of graduates’ competition „Natural Language Processing” (a competition for students hold by National Research University Higher School of Economics)

2014: Scholarship of Academic Council of MSU for scientific activities (a special award for top 10% students with academic excellence and scientific activity)

2013: Enhanced State Academic Scholarship for scientific activities (is awarded on the basis of academic excellence and scientific achievements)

Relevant Experience

Building Open Source Information Extraction System for Russian Language Project

Organisation: National Research University „Higher School of Economics”

Project roles: project manager, software developer (Python)

Description: Creating a hybrid information extraction system using rule-based approach and machine learning technologies. This system is able to extract named entities (persons, locations and organizations) and will become a part of stack technology for NLP developed by National Research University „Higher School of Economics”. At this moment in time the system has 93% precision (evaluated by Dialogue Evaluation Conference on 37 000 annotated texts).


My interest in Machine Translation

Machine Translation is far from being a solved problem. In spite of appearance of many statistical approaches to machine translation, it doesn't cover a lot of aspects of language structure so far. First of all, it doesn't cover languages with small language community due to insufficiency of collected data. Beside that, it doesn't really take into account all the differences in structure of both language.

...

My interest in Apertium projects

I am interested in working on an unreleased language pair for Sicilian-Spanish translation.

My coding challenge

General goals of my coding challenge were:

- to introduce myself to the community;

- to understand the principles of how Apertiums developers team works;

- to get myself familiar with the architecture of the platform;

- to lay the foundations of the project I could accomplish in the summer time.

Regarding the prospective project, I have accomplished following tasks:

- I created a monolingual package for Sicilian language and a bilingual package for Sicilian-Spanish language pair;

- I expanded the dictionary with basic paradigms and most frequent words of Sicilian language (frequent verbs, nouns and adjectives, pronouns, preposition and some adverbs);

- I added the respective translations to the Sicilian-Spanish dictionary;

- I prepared some important resources regarding the structure of the Sicilian language. These resources include: a list of Sicilian words from parsed Sicilian Wiktionary, grammar books about Standard Sicilian Language, Italian-Sicilian online dictionaries and research articles about difference between Sicilian and other romance languages. I also have got into touch with contributors of Wikipedia in Sicilian language and thanks to spectre I also have reached computational linguist who studies in Munich and explored the HFST for Sicilian verbs.

During the accomplishment of coding challenge I commited in svn all the changes I made in the respective packages.


There are the links to the work I've done under the mentors' supervision:

Sicilian monolingual package: here.

Sicilian-Spanish bilingual package: here.

These two scientific articles from Wikipedia can be now translated from Sicilian to Spanish: here and here.

Proposal and work plan

Pre-work period

Main goal: review and improve on the technical and linguistic skills required for the project.

Tasks:

- extend my knowledge of Standard Sicilian language;

- get in contact with native speakers of Sicilian;

- write a Python script to parse sicilian Wikitionary;

- prepare all linguistic resources for Sicilian language;

- try to write transfer rules and lexical selection rules;

- add more basic paradigms for frequent words;


First month


Week 1

Main goal: working with open classes of words (nouns): expanding dictionaries, writing transfer rules.

Tasks:

- prepare a list of Sicilian nouns and respective translations to Spanish;

- prepare the paradigms of nouns in Sicilian;

- add the list in the dictionaries

- add different spelling forms for Sicilian nouns;

- write transfer rules and lexicon selection rules for nouns.


Week 2

Main goal: Working with open classes of words (adjectives and adverbs): expanding dictionaries.

Tasks:

- prepare a list of Sicilian adjective and adverbs and respective translations to Spanish;

- prepare the paradigms of nouns in Sicilian;

- add the list in the dictionaries;

- add different spelling forms for Sicilian verbs and adverbs;

- write transfer rules and lexicon selection rules for nouns.


Week 3

Main goal: Working with open classes of words (verbs): expanding dictionaries.

Tasks:

- create all paradigms for verb conjunction in Sicilian;

- create different spelling forms for verbs;

- create a list of verbs in Sicilian and their translation pairs in Spanish;

- add the verbs in the monolingual dictionary and translation pairs into the dictionaries.


Week 4

Main goal: Continue working with open classes of words (verbs): expanding dictionaries.

Tasks:

- create all missing paradigms for verbs in Sicilian;

- create a list of verbs in Sicilian and their translation pairs in Spanish;

- add the verbs in the monolingual dictionary and translation pairs in bilingual dictionary.


Deliverable #1. Testvoc for nouns, adjectives must be clean.



Second month


Week 5

Main goal: working with open classes of words (verbs): writing transfer rules and lexical selection rules.

Tasks:

- study the differences between usage of verbs and tense in Sicilian and Spanish on the base of the corpus;

- write selection rules for verbs;

- write transfer tules for tenses.


Week 6

Main goal: working with closed classes of words (prepositions, pronouns, numerals): expanding dictionaries.

Tasks:

- create a list of closed class words in Sicilian and their translation pairs in Spanish;

- create all paradigms for closed classes of words in Sicilian;

- add the words in the monolingual dictionary and translation pairs in bilingual dictionary.


Week 7

Main goal: working with closed classes of words (prepositions, pronouns, numerals): writing transfer rules.

Tasks:

- study the differences between usage of added words in Sicilian and Spanish on the base of the corpus;

- write transfer tules.


Week 8

Main goal: checking up the previous work: reviewing dictionaries, fixing bugs.

Tasks:

- check the correctness of the paradigms in the Sicilian dictionary;

- check the correctness of translation pairs in bilingual dictionary;

- check the correctness of transfer rules for every word class;

- check the correctness of lexical selection rules for every word class;

- correct mistakes.


Deliverable #2. Testvoc for verbs must be clean. Testvoc for closed classes must be clean.



Third month

Week 9

Main goal: working with abbreviations and idioms etc: expanding dictionaries.

Tasks:

- prepare a list of frequent abbreviations and their translations to Spanish;

- prepare a list of frequent idioms and their translations to Spanish;

- add the words from the list to both dictionaries.


Week 10

Main goal: continue working with idioms: expanding dictionaries, writing rules.

Tasks:

- expand a list of frequent idioms and their translations to Spanish;

- study the differences in translating idioms in Spanish and Sicilian;

- add new idioms and translation pairs to the dictionaries;

- write lexical selection rules and transfer rules for idioms.


Week 11

Main goal: preparing a parallel corpus for final evaluation.

Tasks:

- collect texts for the corpus (~ 2000 words);

- translate the texts into Spanish.


Week 12

Main goal: prepare for the final evaluation.

Tasks:

- make a preliminary evaluation;

- check the correctness of the paradigms in the Sicilian dictionary;

- check the correctness of translation pairs in bilingual dictionary;

- check the correctness of transfer rules for every word class;

- check the correctness of lexical selection rules for every word class;

- correct mistakes.


Deliverable #2: testvoc must be clean. Cover more than 80% of Sicilian Wikipedia Corpus.



Week 13

Main goal: Reserved for final improvements, documentation, sharing of results.



Project completed.