Difference between revisions of "User:Uliana/gsoc-propuesta"

From Apertium
Jump to navigation Jump to search
Line 42: Line 42:
   
 
'''Description:''' Creating a hybrid information extraction system using rule-based approach and machine learning technologies. This system is able to extract named entities (persons, locations and organizations) and will become a part of stack technology for NLP developed by National Research University „Higher School of Economics”.
 
'''Description:''' Creating a hybrid information extraction system using rule-based approach and machine learning technologies. This system is able to extract named entities (persons, locations and organizations) and will become a part of stack technology for NLP developed by National Research University „Higher School of Economics”.
  +
  +
As a '''project coordinator''' I’m responsible for goals setting, their allocation and setting deadlines, cooperation with other research groups (for example, coreference resolution project), as well as project’s documentation maintenance.
  +
  +
As a '''software developer''' I’m responsible for developing of rule-based module, that currently is test mode operating. I built an ontology of named entities, their synonyms and abbreviations with regard to rich morphology of Russian language. I developed a module that allows to index and tokenize input text, analyze features of each token and extract information about named entities and their attributives on a basis of high precision rules. The module has 93% precision (evaluated by Dialogue Evaluation Conference on 37 000 annotated texts).
  +
   
 
== My interest in Machine Translation ==
 
== My interest in Machine Translation ==
Line 49: Line 54:
 
== My interest in Apertium projects ==
 
== My interest in Apertium projects ==
   
I am interested in working an unreleased language pair for Sicilian - Spanish languages.
+
I am interested in working on an unreleased language pair for Sicilian - Spanish languages.
 
As my coding challenge I created a new language package scn-spa, added basic vocabulary to the dictionary of Sicilian and translations into Sicilian-Spanisch dictionary.
 
As my coding challenge I created a new language package scn-spa, added basic vocabulary to the dictionary of Sicilian and translations into Sicilian-Spanisch dictionary.
 
I also started to conduct research in the structure of Sicilian language: I have got into touch with contributors of Wikipedia in Sicilian language and thanks to ''spectei'' I also have reached computational linguist who studies in Munich and is native speaker of Sicilian.
 
I also started to conduct research in the structure of Sicilian language: I have got into touch with contributors of Wikipedia in Sicilian language and thanks to ''spectei'' I also have reached computational linguist who studies in Munich and is native speaker of Sicilian.

Revision as of 18:01, 17 March 2016

Contacts

Uliana Sentsova

E-mail: uliana.sentsova@gmail.com

Number: +7 (916) 774-95-30

Skype: ulyanasidorova

IRC channel: uliana at #apertium

Education

Lomonosov Moscow State University

Qualification: Bachelor in Linguistics (romance-german languages)

GPA: 10.0 / 10.0


National Research University „Higher School of Economics“

Qualification: Major in Natural Language Processing

Current GPA: 8.5 / 10.0


2015: Awardee of graduates’ competition „Natural Language Processing” (a competition for students hold by National Research University Higher School of Economics)

2014: Scholarship of Academic Council of MSU for scientific activities (a special award for top 10% students with academic excellence and scientific activity)

2013: Enhanced State Academic Scholarship for scientific activities (is awarded on the basis of academic excellence and scientific achievements)

Projects

„Building Open Source Information Extraction System for Russian Language”

Organisation: National Research University „Higher School of Economics”

Project roles: project manager, software developer (Python)

Description: Creating a hybrid information extraction system using rule-based approach and machine learning technologies. This system is able to extract named entities (persons, locations and organizations) and will become a part of stack technology for NLP developed by National Research University „Higher School of Economics”.

As a project coordinator I’m responsible for goals setting, their allocation and setting deadlines, cooperation with other research groups (for example, coreference resolution project), as well as project’s documentation maintenance.

As a software developer I’m responsible for developing of rule-based module, that currently is test mode operating. I built an ontology of named entities, their synonyms and abbreviations with regard to rich morphology of Russian language. I developed a module that allows to index and tokenize input text, analyze features of each token and extract information about named entities and their attributives on a basis of high precision rules. The module has 93% precision (evaluated by Dialogue Evaluation Conference on 37 000 annotated texts).


My interest in Machine Translation

My interest in Apertium projects

I am interested in working on an unreleased language pair for Sicilian - Spanish languages. As my coding challenge I created a new language package scn-spa, added basic vocabulary to the dictionary of Sicilian and translations into Sicilian-Spanisch dictionary. I also started to conduct research in the structure of Sicilian language: I have got into touch with contributors of Wikipedia in Sicilian language and thanks to spectei I also have reached computational linguist who studies in Munich and is native speaker of Sicilian.

Proposal and work plan