Difference between revisions of "User:RomanZegarski/GSoC2011 proposal"
(final improvements (probably)) |
|||
Line 1: | Line 1: | ||
'''Apertium Summer of Code 2011 application:'''<br /> |
|||
⚫ | |||
Dictionary induction from wikis |
|||
== Name == |
|||
⚫ | |||
Roman Zegarski |
|||
⚫ | Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve. |
||
== Contact == |
|||
Email: Roman.Zegarski@gmail.com |
|||
Skype: roman.zegarski |
|||
⚫ | |||
⚫ | |||
IRC: RomanZegarski (irc.freenode.net) |
|||
⚫ | |||
Phone number: +48 692827146 |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve. |
||
⚫ | |||
⚫ | |||
⚫ | It is important for me that Apertium is allows to translate less popular languages. There is still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response. |
||
⚫ | |||
⚫ | |||
== What do you plan to do? == |
== What do you plan to do? == |
||
I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries). |
|||
The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format. |
|||
Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words. |
|||
The idea is to generate new dictionaries with data obtained from DBPedia and OmegaWiki. To achieve this I will use and (if it will be possible) improve the existing OmegaWiki data retriever and amend DBPedia extraction framework to be able to retrieve more data from Wiktionary. Then with this data source I would like to create a dixtools module able to retrieve data and create dictionaries for Apertium. |
|||
== Why Google and Apertium should sponsor it? == |
== Why Google and Apertium should sponsor it? == |
||
A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can become |
|||
⚫ | |||
Also, I will be able to compare data gathered by OmegaWiki, with data harvested from Wiktionary using DBPedia. |
|||
⚫ | |||
== How and who it will benefit in society? == |
|||
Users of Apertium will get more accurate translation. |
|||
Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates. |
|||
A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes. |
|||
== Work plan == |
== Work plan == |
||
⚫ | |||
⚫ | |||
⚫ | |||
* become more familiar with Wiktionary templates |
|||
* get to know more about used ontologies |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
DBPedia mappings |
|||
DBPedia ontology |
|||
get to know GOLD ontology |
|||
get to know the Scala language |
|||
⚫ | |||
=== Week 1 - 3 === |
=== Week 1 - 3 === |
||
Improving DBPedia extraction framework |
* Improving DBPedia extraction framework (coding in Scala) |
||
⚫ | |||
creating code in Scala, which could handle more languages |
|||
creating basic templates to |
* creating basic templates to pl.wiktionary |
||
Week 4 |
=== Week 4 === |
||
expansion of templates for en.wiktionary |
* expansion of templates for en.wiktionary and pl.wiktionary |
||
⚫ | |||
⚫ | |||
<u>'''Deliverable #1:''' </u> |
|||
⚫ | |||
:* improved DBPedia extraction framework able to retrieve data from Wiktionary. |
|||
⚫ | |||
:* templates for English and Polish language. |
|||
⚫ | |||
⚫ | |||
* start working on retrieving data from RDF's |
|||
=== Week 7 === |
|||
* finish work on retrieving data from RDF's |
|||
* create Polish monodix |
|||
* create English monodix |
|||
=== Week 8 === |
=== Week 8 === |
||
* create bilingual dictionary |
|||
create dictionaries in Apertium format |
|||
* final improvements in dixtools module |
|||
⚫ | |||
<u>'''Deliverable #2:'''</u> |
|||
⚫ | |||
:* Polish and English dictionaries created from DBPedia data |
|||
=== Week 9 === |
=== Week 9 === |
||
* update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary) |
|||
retrieve |
* retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever |
||
=== Week 10 - 11 === |
=== Week 10 - 11 === |
||
find if some data from OmegaWiki and DBPedia are complementary |
* find if some data from OmegaWiki and DBPedia are complementary |
||
merge complementary data retrieved from OmegaWiki and DBPedia |
* merge complementary data retrieved from OmegaWiki and DBPedia |
||
=== Week 12 === |
=== Week 12 === |
||
final amendments |
* final amendments |
||
* create documentation for the project |
|||
Project completed ← dictionaries created, new features in dixtools, improved DBPedia extraction framework |
|||
<u>'''Project completed'''</u> |
|||
During the whole project period, code will be tested |
|||
:* dictionaries created |
|||
:* new features in dixtools |
|||
⚫ | |||
== Skills and qualifications == |
== Skills and qualifications == |
||
I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester. |
|||
I have spent some time with topics related to computational linguistics. |
|||
the past year I worked on student project which target was to build virtual student assistant (precisely chatter-bot, generating base of knowledge from university moodle server. It still |
During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet. |
||
algorithm using WordNet. |
|||
About my experience: I have done some part time work in C++ and C# on commercial projects, and I |
About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java). |
||
== Summer plans == |
== Summer plans == |
||
During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project. |
|||
This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time. |
|||
[[Category: GSoC 2011 Student Proposals]] |
Revision as of 18:27, 7 April 2011
Apertium Summer of Code 2011 application:
Dictionary induction from wikis
Contents
- 1 Name
- 2 Contact
- 3 Why is it you are interested in machine translation?
- 4 Why is it that you are interested in the Apertium project?
- 5 Which of the published tasks are you interested in?
- 6 What do you plan to do?
- 7 Why Google and Apertium should sponsor it?
- 8 How and who it will benefit in society?
- 9 Work plan
- 10 Skills and qualifications
- 11 Summer plans
Name
Roman Zegarski
Contact
Email: Roman.Zegarski@gmail.com
Skype: roman.zegarski
IRC: RomanZegarski (irc.freenode.net)
Phone number: +48 692827146
Why is it you are interested in machine translation?
Working on such essential part of communication as language is very interesting for me. Finding similarities between languages and creating rules making possible to translate from one language to another one is intriguing process.
Also I am fascinated by possibilities of sharing knowledge and machine translation is for me kind of bridge that allows to pass information regardless of language in which was created and language known by person retrieving it. Even if translation isn't perfect it gives access to the knowledge with in other ways could be really hard to retrieve.
Why is it that you are interested in the Apertium project?
It is important for me that Apertium is allows to translate less popular languages. There is still not enough resources for them and it is good to know that someone is taking care of them. Moreover it is an open-source project allows people to easy contribute and share they knowledge and interest with others. Also, I am impressed by the community. It is very dynamic and I feel like I always can count on fast response.
Which of the published tasks are you interested in?
I am interested in the following project: “Dictionary inductions form wiki”.
What do you plan to do?
I aim to improve the DBPedia framework so that it is able to extract data from Wiktionaries. There are existing mappings for the German language, but they are far from perfect. I want to create an extensible way of retrieving data from Wiktionary pages (adding a new language should be limited to adding appropriate templates) . I also want to create templates for English and Polish languages (both for testing purposes and in the second part of the project, creating Apertium dictionaries). The second part of the project will be focused on improving dixtools. I would spend most of the time creating a module which uses data retrieved from DBPedia to create dictionaries in Apertium format.
Also as the implementation of OmegaWiki data retrieval is in dixtools I want to gather data from OmegaWiki, compare it with the data obtained from DBPedia and then create dict files covering most words.
Why Google and Apertium should sponsor it?
A new source of data will bring to the Apertium project the possibilities to constantly improve dictionaries and make it easier to create new ones. While Wiktionary is still gaining more content it can become
New linguistic data would be published as Linked Data, so it would be accessible to bigger publicity.
How and who it will benefit in society?
Users of Apertium will get more accurate translation.
Developers of Apertium will get a new source of constantly improving data which can make the development of new dictionaries easier. Moreover new languages can be easily added by simply creating new templates.
A wide range of DBPedia users gain access to more linquistic data that can be used for their own purposes.
Work plan
Community Bonding Period
- get more familiar with Apertium and its community
- retrieve more information about DBPedia
- become more familiar with Wiktionary templates
- get to know more about used ontologies
- read documentation related to the project
Week 1 - 3
- Improving DBPedia extraction framework (coding in Scala)
- creating basic templates to en.wiktionary
- creating basic templates to pl.wiktionary
Week 4
- expansion of templates for en.wiktionary and pl.wiktionary
Deliverable #1:
- improved DBPedia extraction framework able to retrieve data from Wiktionary.
- templates for English and Polish language.
Week 5 - 6
- create module for dixtools retrieving data from DBPedia
- start working on retrieving data from RDF's
Week 7
- finish work on retrieving data from RDF's
- create Polish monodix
- create English monodix
Week 8
- create bilingual dictionary
- final improvements in dixtools module
Deliverable #2:
- completed Apertium-dixtools module creating dictionaries from data extracted from DBPedia
- Polish and English dictionaries created from DBPedia data
Week 9
- update existing OmegaWiki data retriever implemented in apertium-dixtools (if necessary)
- retrieve dictionary data from OmegaWiki using dixtools OmegaWiki data retriever
Week 10 - 11
- find if some data from OmegaWiki and DBPedia are complementary
- merge complementary data retrieved from OmegaWiki and DBPedia
Week 12
- final amendments
- create documentation for the project
Project completed During the whole project period, code will be tested
- dictionaries created
- new features in dixtools
- improved DBPedia extraction framework
Skills and qualifications
I am a final year student of the Gdansk University of Technology in Poland (Information Technology, specialization - Distributed Applications and Internet Services). During my study period I spent a lot of time improving my programming skills. I feel most confident in Java (JSE and JEE), C# and C/C+. I gained knowledge about intelligent information services (general information about ontologies, semantic webs) from courses in the previous semester. I have spent some time with topics related to computational linguistics. During the past year I worked on a student project which target was to build a virtual student assistant (precisely a chatter-bot, generating a base of knowledge from our university moodle server. It still needs some work, but most of the application's functionality is working fine). Currently I am working on the development and implementation of a word sense disambiguation algorithm using WordNet. About my experience: I have done some part time work in C++ and C# on commercial projects, and I gained experience in Java at my university (both projects mentioned earlier are written in Java).
Summer plans
During the summer time I could spend 30 hours a week or more on project development. I spent the last few months sharing my time between my student responsibilities and work, so if I were chosen to the Apertium project it wouldn't be a problem for me to spend the required time on project. This semester I won't have any exams in the Summer of Code time and I plan to stay in Gdansk during the summer, so I would be available all the time.