Difference between revisions of "User:RomanZegarski/GSoC2011 proposal"

From Apertium
Jump to navigation Jump to search
(restore from history)
 
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
[[Category:GSoC 2011 Student Proposals]]
  +
'''Apertium Summer of Code 2011 application:'''<br />
  +
Dictionary induction from wikis
   
== Why is it you are interested in machine translation? ==
 
   
  +
== Name ==
  +
Roman Zegarski
   
  +
== Contact ==
Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another is intriguing process.
 
  +
Email: Roman.Zegarski@gmail.com
Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve.
 
   
  +
Skype: roman.zegarski
   
  +
IRC: RomanZegarski (irc.freenode.net)
== Why is it that you are interested in the Apertium project? ==
 
 
   
  +
Phone number: +48 692827146
It's important for me that Apertium is allows to translate less popular languages. There is still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It's very dynamic and I feel like I always can count on fast response.
 
   
   
== Which of the published tasks are you interested in? ==
+
== Why is it you are interested in machine translation? ==
   
 
Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.
   
 
Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge which in other ways could be really hard to retrieve.
I am interested in project: “Dictionary inductions form wiki”.
 
  +
 
== Why is it that you are interested in the Apertium project? ==
 
 
It is important for me that Apertium is allows to translate less popular languages. There are still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.
  +
  +
== Which of the published tasks are you interested in? ==
   
 
I am interested in the following project: “Dictionary inductions from wiki”.
   
 
== What do you plan to do? ==
 
== What do you plan to do? ==
 
 
  +
I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries).
  +
The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.
   
  +
Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.
The idea is to generate new dictionaries with data obtained from DBPedia and OmegaWiki. To achieve this I will use and (if it will be possible) improve the existing OmegaWiki data retriever and amend DBPedia extraction framework to be able to retrieve more data from Wiktionary. Then with this data source I would like to create a dixtools module able to retrieve data and create dictionaries for Apertium.
 
   
 
== Why Google and Apertium should sponsor it? ==
 
== Why Google and Apertium should sponsor it? ==
   
New source of data will bring to Apertium project possibilities to constantly improve dictionaries and make it easier to create new ones.
+
A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can be a useful source of information.
New linguistic data would be published as Linked Data, so they would be accessible to bigger publicity.
 
Also, I will be able to compare data gathered by OmegaWiki, with data harvested from Wiktionary using DBPedia.
 
   
 
New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.
   
   
  +
== How and who it will benefit in society? ==
  +
Users of Apertium will get more accurate translation.
   
  +
Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.
  +
  +
A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.
 
== Work plan ==
 
== Work plan ==
   
 
=== Community Bonding Period ===
 
* get more familiar with Apertium and its community
 
* retrieve more information about DBPedia
  +
* become more familiar with Wiktionary templates
  +
* get to know more about used ontologies
 
* read documentation related to the project
   
Community Bonding Period:
 
get more familiar with Apertium community
 
retrieve more information about DBPedia
 
DBPedia mappings
 
DBPedia ontology
 
get to know GOLD ontology
 
get to know the Scala language
 
read documentation related to the project
 
   
 
=== Week 1 - 3 ===
 
=== Week 1 - 3 ===
Improving DBPedia extraction framework
+
* Improving DBPedia extraction framework (coding in Scala)
 
* creating basic templates to en.wiktionary
creating code in Scala, which could handle more languages
 
creating basic templates to English Wiktionary
+
* creating basic templates to pl.wiktionary
Week 4:
+
=== Week 4 ===
expansion of templates for en.wiktionary
+
* expansion of templates for en.wiktionary and pl.wiktionary
create templates for pl.wiktionary
 
   
'''Deliverable #1''' ← improved DBPedia extraction framework
 
   
  +
<u>'''Deliverable #1:''' </u>
=== Week 5 - 7 ===
 
  +
:* improved DBPedia extraction framework able to retrieve data from Wiktionary.
create module for dixtools retrieving data from DBPedia
 
  +
:* templates for English and Polish language.
  +
 
=== Week 5 - 6 ===
 
* create module for dixtools retrieving data from DBPedia
  +
* start working on retrieving data from RDF's
  +
=== Week 7 ===
  +
* finish work on retrieving data from RDF's
  +
* create English and Polish monodix as 'proof of concept'
  +
 
=== Week 8 ===
 
=== Week 8 ===
  +
* create bilingual dictionary
create dictionaries in Apertium format
 
  +
* final improvements in dixtools module
   
  +
'''Deliverable #2''' ← Aperitum-dixtools module creating dictionaries from data extracted from DBPedia
 
  +
<u>'''Deliverable #2:'''</u>
 
:* completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
  +
:* Polish and English dictionaries created from DBPedia data
  +
   
 
=== Week 9 ===
 
=== Week 9 ===
improving existing OmegaWiki data retriever implemented in apertium-dixtools
+
* update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
retrieve dictionaries data from OmegaWiki using dixtools
+
* retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever
 
=== Week 10 - 11 ===
 
=== Week 10 - 11 ===
find if some data from OmegaWiki and DBPedia are complementary
+
* find if some data from OmegaWiki and DBPedia are complementary
merge complementary data retrieved from OmegaWiki and DBPedia
+
* merge complementary data retrieved from OmegaWiki and DBPedia
 
=== Week 12 ===
 
=== Week 12 ===
final amendments
+
* final amendments
creation of documentation for the project
+
* create documentation for the project
 
Project completed ← dictionaries created, new features in dixtools, improved DBPedia extraction framework
 
   
  +
<u>'''Project completed'''</u>
  +
During the whole project period, code will be tested
  +
:* dictionaries created
  +
:* new features in dixtools
 
:* improved DBPedia extraction framework
   
 
== Skills and qualifications ==
 
== Skills and qualifications ==
   
I am final year student on the Gdańsk University of Technology in Poland (Informatics, specialization - Distributed Applications and Internet Services).I have spent some time with topics related to computational linguistics. In
+
I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester.
  +
I have spent some time with topics related to computational linguistics.
the past year I worked on student project which target was to build virtual student assistant (precisely chatter-bot, generating base of knowledge from university moodle server. It still need some work, but most application functionality is working fine). Currently I am working on development and implementation of word sense disambiguation
+
During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet.
algorithm using WordNet.
 
About my experience: I have done some part time work in C++ and C# on commercial projects, and I am experienced in Java from university (both projects mentioned earlier are written in Java).
+
About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).
 
   
 
== Summer plans ==
 
== Summer plans ==
In the summer time I could spend 30 hours or more on developing project. I spent the last few months sharing my time between my student responsibilities and work, so if I would participate in Apertium project it won't be a problem for me to spend required time on coding.
+
During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project.
At this semester I won't have any exams in the Summer of Code time and I plan to stay in Gdańsk in the summer, so I would be available all the time.
+
This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.

Latest revision as of 19:30, 7 April 2011

Apertium Summer of Code 2011 application:
Dictionary induction from wikis


Name[edit]

Roman Zegarski

Contact[edit]

Email: Roman.Zegarski@gmail.com

Skype: roman.zegarski

IRC: RomanZegarski (irc.freenode.net)

Phone number: +48 692827146


Why is it you are interested in machine translation?[edit]

Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.

Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge which in other ways could be really hard to retrieve.

Why is it that you are interested in the Apertium project?[edit]

It is important for me that Apertium is allows to translate less popular languages. There are still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.

Which of the published tasks are you interested in?[edit]

I am interested in the following project: “Dictionary inductions from wiki”.

What do you plan to do?[edit]

I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries). The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.

Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.

Why Google and Apertium should sponsor it?[edit]

A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can be a useful source of information.

New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.


How and who it will benefit in society?[edit]

Users of Apertium will get more accurate translation.

Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.

A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.

Work plan[edit]

Community Bonding Period[edit]

  • get more familiar with Apertium and its community
  • retrieve more information about DBPedia
  • become more familiar with Wiktionary templates
  • get to know more about used ontologies
  • read documentation related to the project


Week 1 - 3[edit]

  • Improving DBPedia extraction framework (coding in Scala)
  • creating basic templates to en.wiktionary
  • creating basic templates to pl.wiktionary

Week 4[edit]

  • expansion of templates for en.wiktionary and pl.wiktionary


Deliverable #1:

  • improved DBPedia extraction framework able to retrieve data from Wiktionary.
  • templates for English and Polish language.

Week 5 - 6[edit]

  • create module for dixtools retrieving data from DBPedia
  • start working on retrieving data from RDF's

Week 7[edit]

  • finish work on retrieving data from RDF's
  • create English and Polish monodix as 'proof of concept'

Week 8[edit]

  • create bilingual dictionary
  • final improvements in dixtools module


Deliverable #2:

  • completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
  • Polish and English dictionaries created from DBPedia data


Week 9[edit]

  • update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
  • retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever

Week 10 - 11[edit]

  • find if some data from OmegaWiki and DBPedia are complementary
  • merge complementary data retrieved from OmegaWiki and DBPedia

Week 12[edit]

  • final amendments
  • create documentation for the project

Project completed During the whole project period, code will be tested

  • dictionaries created
  • new features in dixtools
  • improved DBPedia extraction framework

Skills and qualifications[edit]

I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester. I have spent some time with topics related to computational linguistics. During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet. About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).

Summer plans[edit]

During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project. This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.