User:RomanZegarski/GSoC2011 proposal

From Apertium
< User:RomanZegarski
Revision as of 11:04, 7 April 2011 by Jimregan (talk | contribs) (restore from history)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Why is it you are interested in machine translation?

Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another is intriguing process. Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve.


Why is it that you are interested in the Apertium project?

It's important for me that Apertium is allows to translate less popular languages. There is still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It's very dynamic and I feel like I always can count on fast response.


Which of the published tasks are you interested in?

I am interested in project: “Dictionary inductions form wiki”.


What do you plan to do?

The idea is to generate new dictionaries with data obtained from DBPedia and OmegaWiki. To achieve this I will use and (if it will be possible) improve the existing OmegaWiki data retriever and amend DBPedia extraction framework to be able to retrieve more data from Wiktionary. Then with this data source I would like to create a dixtools module able to retrieve data and create dictionaries for Apertium.

Why Google and Apertium should sponsor it?

New source of data will bring to Apertium project possibilities to constantly improve dictionaries and make it easier to create new ones. New linguistic data would be published as Linked Data, so they would be accessible to bigger publicity. Also, I will be able to compare data gathered by OmegaWiki, with data harvested from Wiktionary using DBPedia.



Work plan

Community Bonding Period: get more familiar with Apertium community retrieve more information about DBPedia DBPedia mappings DBPedia ontology get to know GOLD ontology get to know the Scala language read documentation related to the project

Week 1 - 3

Improving DBPedia extraction framework creating code in Scala, which could handle more languages creating basic templates to English Wiktionary Week 4: expansion of templates for en.wiktionary create templates for pl.wiktionary

Deliverable #1 ← improved DBPedia extraction framework

Week 5 - 7

create module for dixtools retrieving data from DBPedia

Week 8

create dictionaries in Apertium format

Deliverable #2 ← Aperitum-dixtools module creating dictionaries from data extracted from DBPedia

Week 9

improving existing OmegaWiki data retriever implemented in apertium-dixtools retrieve dictionaries data from OmegaWiki using dixtools

Week 10 - 11

find if some data from OmegaWiki and DBPedia are complementary merge complementary data retrieved from OmegaWiki and DBPedia

Week 12

final amendments creation of documentation for the project

Project completed ← dictionaries created, new features in dixtools, improved DBPedia extraction framework


Skills and qualifications

I am final year student on the Gdańsk University of Technology in Poland (Informatics, specialization - Distributed Applications and Internet Services).I have spent some time with topics related to computational linguistics. In the past year I worked on student project which target was to build virtual student assistant (precisely chatter-bot, generating base of knowledge from university moodle server. It still need some work, but most application functionality is working fine). Currently I am working on development and implementation of word sense disambiguation algorithm using WordNet. About my experience: I have done some part time work in C++ and C# on commercial projects, and I am experienced in Java from university (both projects mentioned earlier are written in Java).


Summer plans

In the summer time I could spend 30 hours or more on developing project. I spent the last few months sharing my time between my student responsibilities and work, so if I would participate in Apertium project it won't be a problem for me to spend required time on coding. At this semester I won't have any exams in the Summer of Code time and I plan to stay in Gdańsk in the summer, so I would be available all the time.