User:RomanZegarski/GSoC2011 proposal

From Apertium
Jump to navigation Jump to search

Apertium Summer of Code 2011 application:
Dictionary induction from wikis


Name[edit]

Roman Zegarski

Contact[edit]

Email: Roman.Zegarski@gmail.com

Skype: roman.zegarski

IRC: RomanZegarski (irc.freenode.net)

Phone number: +48 692827146


Why is it you are interested in machine translation?[edit]

Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.

Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge which in other ways could be really hard to retrieve.

Why is it that you are interested in the Apertium project?[edit]

It is important for me that Apertium is allows to translate less popular languages. There are still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.

Which of the published tasks are you interested in?[edit]

I am interested in the following project: “Dictionary inductions from wiki”.

What do you plan to do?[edit]

I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries). The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.

Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.

Why Google and Apertium should sponsor it?[edit]

A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can be a useful source of information.

New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.


How and who it will benefit in society?[edit]

Users of Apertium will get more accurate translation.

Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.

A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.

Work plan[edit]

Community Bonding Period[edit]

  • get more familiar with Apertium and its community
  • retrieve more information about DBPedia
  • become more familiar with Wiktionary templates
  • get to know more about used ontologies
  • read documentation related to the project


Week 1 - 3[edit]

  • Improving DBPedia extraction framework (coding in Scala)
  • creating basic templates to en.wiktionary
  • creating basic templates to pl.wiktionary

Week 4[edit]

  • expansion of templates for en.wiktionary and pl.wiktionary


Deliverable #1:

  • improved DBPedia extraction framework able to retrieve data from Wiktionary.
  • templates for English and Polish language.

Week 5 - 6[edit]

  • create module for dixtools retrieving data from DBPedia
  • start working on retrieving data from RDF's

Week 7[edit]

  • finish work on retrieving data from RDF's
  • create English and Polish monodix as 'proof of concept'

Week 8[edit]

  • create bilingual dictionary
  • final improvements in dixtools module


Deliverable #2:

  • completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
  • Polish and English dictionaries created from DBPedia data


Week 9[edit]

  • update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
  • retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever

Week 10 - 11[edit]

  • find if some data from OmegaWiki and DBPedia are complementary
  • merge complementary data retrieved from OmegaWiki and DBPedia

Week 12[edit]

  • final amendments
  • create documentation for the project

Project completed During the whole project period, code will be tested

  • dictionaries created
  • new features in dixtools
  • improved DBPedia extraction framework

Skills and qualifications[edit]

I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester. I have spent some time with topics related to computational linguistics. During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet. About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).

Summer plans[edit]

During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project. This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.