Difference between revisions of "User:RomanZegarski/GSoC2011 proposal"

Latest revision as of 19:30, 7 April 2011

Apertium Summer of Code 2011 application:
Dictionary induction from wikis

Name[edit]

Roman Zegarski

Contact[edit]

Email: Roman.Zegarski@gmail.com

Skype: roman.zegarski

IRC: RomanZegarski (irc.freenode.net)

Phone number: +48 692827146

Why is it you are interested in machine translation?[edit]

Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.

Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge which in other ways could be really hard to retrieve.

Why is it that you are interested in the Apertium project?[edit]

It is important for me that Apertium is allows to translate less popular languages. There are still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.

Which of the published tasks are you interested in?[edit]

I am interested in the following project: “Dictionary inductions from wiki”.

What do you plan to do?[edit]

I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries). The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.

Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.

Why Google and Apertium should sponsor it?[edit]

A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can be a useful source of information.

New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.

How and who it will benefit in society?[edit]

Users of Apertium will get more accurate translation.

Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.

A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.

Work plan[edit]

Community Bonding Period[edit]

get more familiar with Apertium and its community
retrieve more information about DBPedia
become more familiar with Wiktionary templates
get to know more about used ontologies
read documentation related to the project

Week 1 - 3[edit]

Improving DBPedia extraction framework (coding in Scala)
creating basic templates to en.wiktionary
creating basic templates to pl.wiktionary

Week 4[edit]

expansion of templates for en.wiktionary and pl.wiktionary

Deliverable #1:

improved DBPedia extraction framework able to retrieve data from Wiktionary.
templates for English and Polish language.

Week 5 - 6[edit]

create module for dixtools retrieving data from DBPedia
start working on retrieving data from RDF's

Week 7[edit]

finish work on retrieving data from RDF's
create English and Polish monodix as 'proof of concept'

Week 8[edit]

create bilingual dictionary
final improvements in dixtools module

Deliverable #2:

completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
Polish and English dictionaries created from DBPedia data

Week 9[edit]

update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever

Week 10 - 11[edit]

find if some data from OmegaWiki and DBPedia are complementary
merge complementary data retrieved from OmegaWiki and DBPedia

Week 12[edit]

final amendments
create documentation for the project

Project completed During the whole project period, code will be tested

dictionaries created
new features in dixtools
improved DBPedia extraction framework

Skills and qualifications[edit]

I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester. I have spent some time with topics related to computational linguistics. During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet. About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).

Summer plans[edit]

During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project. This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.

@@ Line 1: / Line 1: @@
+[[Category:GSoC 2011 Student Proposals]]
+'''Apertium Summer of Code 2011 application:'''<br />
+Dictionary induction from wikis
-== Why is it you are interested in machine translation? ==
+== Name ==
+Roman Zegarski
+== Contact ==
-	Working on such essential part of communication as language is very interesting for me.  Finding similarities between languages and creating rules making possible to translate from one language to another is intriguing process.
+Email: Roman.Zegarski@gmail.com
-Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve.
+Skype: roman.zegarski
+IRC: RomanZegarski (irc.freenode.net)
-== Why is it that you are interested in the Apertium project? ==
+Phone number: +48 692827146
-	It's important for me that Apertium is allows to translate less popular languages. There is still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It's very dynamic and I feel like I always can count on fast response.
-== Which of the published tasks are you interested in? ==
+== Why is it you are interested in machine translation? ==
+Working on such essential part of communication as language is very interesting for me.  Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.
+Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge which in other ways could be really hard to retrieve.
-I am interested in project: “Dictionary inductions form wiki”.
+== Why is it that you are interested in the Apertium project? ==
+It is important for me that Apertium is allows to translate less popular languages. There are still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.
+== Which of the published tasks are you interested in? ==
+I am interested in the following project: “Dictionary inductions from wiki”.
 ==  What do you plan to do? ==
+I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries).
+The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.
+Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.
-	The idea is to generate new dictionaries with data obtained from DBPedia and OmegaWiki. To achieve this I will  use and (if it will be possible) improve the existing OmegaWiki data retriever and amend DBPedia extraction framework to be able to retrieve more data from Wiktionary. Then with this data source I would like to create a dixtools module able to retrieve data and create dictionaries for Apertium.
 == Why Google and Apertium should sponsor it? ==
-	New source of data will bring to Apertium project possibilities to constantly improve dictionaries and make it easier to create new ones.
+A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can be a useful source of information.
-New linguistic data would be published as Linked Data, so they would be accessible to bigger publicity.
-Also, I will be able to compare data gathered by OmegaWiki, with data harvested from Wiktionary using DBPedia.
+New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.
+== How and who it will benefit in society?  ==
+Users of Apertium will get more accurate translation.
+Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.
+A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.
 == Work plan ==
+=== Community Bonding Period ===
+* get more familiar with Apertium and its community
+* retrieve more information about DBPedia
+* become more familiar with Wiktionary templates
+* get to know more about used ontologies
+* read documentation related to the project
-Community Bonding Period:
-get more familiar with Apertium community
-retrieve more information about DBPedia
-DBPedia mappings
-DBPedia ontology
-get to know GOLD ontology
-get to know the Scala language
-read documentation related to the project
 === Week 1 - 3 ===
-Improving DBPedia extraction framework
+* Improving DBPedia extraction framework (coding in Scala)
+* creating basic templates to en.wiktionary
-creating code in Scala, which could handle more languages
-creating basic templates to English Wiktionary
+* creating basic templates to pl.wiktionary
-Week 4:
+=== Week 4 ===
-expansion of templates for en.wiktionary
+* expansion of templates for en.wiktionary and pl.wiktionary
-create templates for pl.wiktionary
-'''Deliverable #1'''  ←    improved DBPedia extraction framework
+<u>'''Deliverable #1:'''  </u>
-=== Week 5 - 7 ===
+:* improved DBPedia extraction framework able to retrieve data from Wiktionary.
-create module for dixtools retrieving data from DBPedia
+:* templates for English and Polish language.
+=== Week 5 - 6 ===
+* create module for dixtools retrieving data from DBPedia
+* start working on retrieving data from RDF's
+=== Week 7 ===
+* finish work on retrieving data from RDF's
+* create English and Polish monodix as 'proof of concept'
 === Week 8 ===
+* create bilingual dictionary
-create dictionaries in Apertium format
+* final improvements in dixtools module
-'''Deliverable #2''' ← Aperitum-dixtools module creating dictionaries from data extracted from DBPedia
+<u>'''Deliverable #2:'''</u>
+:* completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
+:* Polish and English dictionaries created from DBPedia data
 === Week 9 ===
-improving existing OmegaWiki data retriever implemented in apertium-dixtools
+* update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
-retrieve dictionaries data from OmegaWiki using dixtools
+* retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever
 === Week 10 - 11 ===
-find if some data from OmegaWiki and DBPedia are complementary
+* find if some data from OmegaWiki and DBPedia are complementary
-merge complementary data retrieved from OmegaWiki and DBPedia
+* merge complementary data retrieved from OmegaWiki and DBPedia
 === Week 12 ===
-final amendments
+* final amendments
-creation of documentation for the project
+* create documentation for the project
-Project completed ← dictionaries created, new features in dixtools, improved DBPedia extraction framework
+<u>'''Project completed'''</u>
+During the whole project period, code will be tested
+:* dictionaries created
+:* new features in dixtools
+:* improved DBPedia extraction framework
 == Skills and qualifications ==
-	I am final year student on the Gdańsk University of Technology in Poland (Informatics, specialization - Distributed Applications and Internet Services).I have spent some time with topics related to computational linguistics. In
+I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester.
+I have spent some time with topics related to computational linguistics.
-the past year I worked on student project which target was to build virtual student assistant (precisely chatter-bot, generating base of knowledge from university moodle server. It still need some work, but most application functionality is working fine). Currently I am working on development and implementation of word sense disambiguation
+During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet.
-algorithm using WordNet.
-About my experience: I have done some part time work in C++ and C# on commercial projects, and I am experienced in Java from university (both projects mentioned earlier are written in Java).
+About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).
 == Summer plans ==
-	In the summer time I could  spend 30 hours or more on developing project.  I spent the last few months sharing my time between my student responsibilities and work, so if I would participate in Apertium project it won't be a problem for me to spend required time on coding.
+During the summer time I could spend 30 hours a week or more on project development.  I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project.
-At this semester I won't have any exams in the Summer of Code time and I plan to stay in Gdańsk in the summer, so I would be available all the time.
+This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.

Difference between revisions of "User:RomanZegarski/GSoC2011 proposal"

Latest revision as of 19:30, 7 April 2011

Contents

Name[edit]

Contact[edit]

Why is it you are interested in machine translation?[edit]

Why is it that you are interested in the Apertium project?[edit]

Which of the published tasks are you interested in?[edit]

What do you plan to do?[edit]

Why Google and Apertium should sponsor it?[edit]

How and who it will benefit in society?[edit]

Work plan[edit]

Community Bonding Period[edit]

Week 1 - 3[edit]

Week 4[edit]

Week 5 - 6[edit]

Week 7[edit]

Week 8[edit]

Week 9[edit]

Week 10 - 11[edit]

Week 12[edit]

Skills and qualifications[edit]

Summer plans[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools