Difference between revisions of "User:Oresta/GSoC Proposal"
| (10 intermediate revisions by the same user not shown) | |||
| Line 8: | Line 8: | ||
| IRC: Oresta in #apertium at irc.freenode.net<br /> | IRC: Oresta in #apertium at irc.freenode.net<br /> | ||
| Cell phone: +38 067 37 97 836<br /> | Cell phone: +38 067 37 97 836<br /> | ||
| == Abstract == | |||
| Both Polish and Ukrainian belonging to the Slavic languages group, which is not well represented in Apertium yet. There is no releases for Polish and Ukrainian, but bilingual dictionary for these languages already exists as well as preliminary version of Polish and pre-preliminary version of Ukrainian monolingual dictionaries. All dictionaries needs expanding to have at least 5,000 words, adding paradigms and transfer rules. | |||
| == Why is it you are interested in machine translation?  == | == Why is it you are interested in machine translation?  == | ||
| Line 16: | Line 22: | ||
| == Why is it that you are interested in the Apertium project? == | == Why is it that you are interested in the Apertium project? == | ||
| Apertium is a great example of fast developing open source project. It is open for everyone to use and to develop. <br /> | |||
| In most of MT systems such languages like Ukrainian appears in the least. But in Appertium community an idea of Polish-Ukrainian translation was graciously accepted.<br /> | |||
| I like atmosphere in Apertium community which I saw at IRC and mailing list. | |||
| == Why should Google and Apertium sponsor it ? == | == Why should Google and Apertium sponsor it ? == | ||
| Line 45: | Line 53: | ||
| Firstly I want to say, I do not starting from scratch. Bilingual Polish-Ukrainian dictionary exists in Apertium format. I already checked it with a mentor Jimmy O’Regan. Of course it needs expanding by adding high frequently words and correction some translations.<br /> | Firstly I want to say, I do not starting from scratch. Bilingual Polish-Ukrainian dictionary exists in Apertium format. I already checked it with a mentor Jimmy O’Regan. Of course it needs expanding by adding high frequently words and correction some translations.<br /> | ||
| Most comfortable way to build any kind of dictionaries is to use well formed annotated corpus. That’s why for Polish I will use Polish IPI PAN Corpus (250 mln words) - GNU GPL (http://korpus.pl/index.php?page=download) and open source corpus manager Poliqarp - Creative Commons License (http://korpus.pl/index.php?page=poliqarp). IPI PAN Corpus is POS- and morpho-syntactically tagged, it gives a possibility to get all necessary information to build Polish monolingual dictionary. Existing Polish monolingual dictionary will be helpful too, because it describes a lot of word inflections.<br /> | Most comfortable way to build any kind of dictionaries is to use well formed annotated corpus. That’s why for Polish I will use Polish IPI PAN Corpus (250 mln words) - GNU GPL (http://korpus.pl/index.php?page=download) and open source corpus manager Poliqarp - Creative Commons License (http://korpus.pl/index.php?page=poliqarp). IPI PAN Corpus is POS- and morpho-syntactically tagged, it gives a possibility to get all necessary information to build Polish monolingual dictionary. Existing Polish monolingual dictionary will be helpful too, because it describes a lot of word inflections.<br /> | ||
| There is not available open source corpus for Ukrainian :(. That’s why I am going to use Ukrainian Wikipedia to form Ukrainian frequency list. Also I will use Aspell Ukrainian Dictionary ( | There is not available open source corpus for Ukrainian :(. That’s why I am going to use Ukrainian Wikipedia to form Ukrainian frequency list. Also I will use Aspell Ukrainian Dictionary (http://sourceforge.net/projects/ispell-uk). It is distributed under GNU GPL and contains approximately 100,000 lemmas. Aspell Dictionary consists from two files:<br /> | ||
| * dictionary – includes lemmas | * dictionary – includes lemmas and affix labels which describes inflection rules for each lemma<br /> | ||
| * list of affixes – includes affixes for each type of label.<br /> | * list of affixes – includes affixes for each type of label.<br /> | ||
| There is no explicit information about Parts of Speech in this dictionary, but inflection categories are described. Since different parts of speech has different inflection rules – getting POS information from this vocabulary is  possible. Summing up, converting Aspell Dictionary to the Ukrainian monodix for Appertium is realistic task.<br /> | There is no explicit information about Parts of Speech in this dictionary, but inflection categories are described. Since different parts of speech has different inflection rules – getting POS information from this vocabulary is  possible. Summing up, converting Aspell Dictionary to the Ukrainian monodix for Appertium is realistic task.<br /> | ||
| I intend to fill up each dictionary to at least 5,000 words.<br /> | |||
| Also defining of transcript rules is needed, because of using of different alphabets in each language. | Also defining of transcript rules is needed, because of using of different alphabets in each language. | ||
| === Community Bonding Period === | === Community Bonding Period === | ||
| Learning and comparison of Polish and Ukrainian grammar;<br /> | |||
| Studying Apertium and its documentation;<br /> | |||
| Evaluating and correcting existing dictionaries. | |||
| === Week 1 === | === Week 1 === | ||
| Forming frequency list for Ukrainian from Wikipedia.<br /> | |||
| Working on Ukrainian morphological dictionary, focusing on closed class words. | |||
| === Week 2 === | === Week 2 === | ||
| Making of paradigms for Ukrainian morphological dictionary.<br /> | |||
| Starting adding open class words. | |||
| === Week 3 === | === Week 3 === | ||
| Continuation of work on Ukrainian morphological dictionary by adding open class words. | |||
| === Week 4 === | === Week 4 === | ||
| Working with existing Polish morphological dictionary, adding closed class words. | |||
| === Week 5 === | === Week 5 === | ||
| Making of paradigms for Polish morphological dictionary.<br /> | |||
| Starting adding open class words. | |||
| === Week 6 === | === Week 6 === | ||
| Adding open class words to Polish morphological dictionary. | |||
| === Week 7 === | === Week 7 === | ||
| Synchronizing  bilingual and monolingual dictionaries.<br /> | |||
| After that size of each dictionary should be 5,000 minimum. | |||
| === Week 8 === | === Week 8 === | ||
| Working on transfer rules. | |||
| === Week 9 === | === Week 9 === | ||
| Continuation of work on transfer rules. | |||
| === Week 10 === | === Week 10 === | ||
| Testing and manual correction of dictionaries.<br /> | |||
| Adding transcription rules. | |||
| === Week 11 === | === Week 11 === | ||
| Continuation testing and manual correction of dictionaries. | |||
| === Week 12 === | === Week 12 === | ||
| Evaluating accuracy and project documenting. | |||
| == List your skills and give evidence of your qualifications == | == List your skills and give evidence of your qualifications == | ||
| Presently I am  | Presently I am 2nd year PhD student in Computer Sciences at Lviv Polytechnic University. <br /> | ||
| I have been one year (2007-2008) at International Visegrad Master Program at the University of Warsaw.<br />  | I have been one year (2007-2008) at International Visegrad Master Program at the University of Warsaw.<br />  | ||
| I attended courses "Methods and Tools for Text Processing" (Prof. Janusz Stanisław Bień) and "Linguistic Engineering" (Dr. Adam Przepiórkowski). This made me familiar with different NLP techniques and corpus standards, like text segmentation, HMM, PoS tagging,  XML, TEI, Unicode, etc.<br /> | I attended courses "Methods and Tools for Text Processing" (Prof. Janusz Stanisław Bień) and "Linguistic Engineering" (Dr. Adam Przepiórkowski). This made me familiar with different NLP techniques and corpus standards, like text segmentation, HMM, PoS tagging,  XML, TEI, Unicode, etc.<br /> | ||
| Line 102: | Line 119: | ||
| == List any non-Summer-of-Code plans you have for the Summer == | == List any non-Summer-of-Code plans you have for the Summer == | ||
| GsoC is my only plan for the summer. | GsoC is my only plan for the summer. I can work full time on the project. | ||
Latest revision as of 12:38, 9 April 2010
Contents
- 1 Name
- 2 Contact information
- 3 Abstract
- 4 Why is it you are interested in machine translation?
- 5 Why is it that you are interested in the Apertium project?
- 6 Why should Google and Apertium sponsor it ?
- 7 How and who will it benefit in society ?
- 8 Which of the published tasks are you interested in? What do you plan to do?
- 9 Work plan
- 10 List your skills and give evidence of your qualifications
- 11 List any non-Summer-of-Code plans you have for the Summer
Name[edit]
Oresta Tymchyshyn
Contact information[edit]
e-mail: oresta.tymchyshyn@gmail.com
IRC: Oresta in #apertium at irc.freenode.net
Cell phone: +38 067 37 97 836
Abstract[edit]
Both Polish and Ukrainian belonging to the Slavic languages group, which is not well represented in Apertium yet. There is no releases for Polish and Ukrainian, but bilingual dictionary for these languages already exists as well as preliminary version of Polish and pre-preliminary version of Ukrainian monolingual dictionaries. All dictionaries needs expanding to have at least 5,000 words, adding paradigms and transfer rules.
Why is it you are interested in machine translation?[edit]
Being a student of Computer Science, I always tried to get NLP-related tasks for course projects. At the beginning it was subconsciously, but during last four years I am strongly interested in computational linguistics.
Machine translation is a part of NLP, but it is not only reason of my interest. I am using machine translation often for getting to know meaning of unknown words, for example during translation part of the book by Adam Przepiórkowski “The IPI PAN Corpus: Preliminary version” (http://nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/) or site of European Summer Scholl Culture & Technology (http://www.culingtec.uni-leipzig.de/ESU/, Ukrainian coming soon). So my life experience shows that MT is really useful.
Why is it that you are interested in the Apertium project?[edit]
Apertium is a great example of fast developing open source project. It is open for everyone to use and to develop. 
In most of MT systems such languages like Ukrainian appears in the least. But in Appertium community an idea of Polish-Ukrainian translation was graciously accepted.
I like atmosphere in Apertium community which I saw at IRC and mailing list.
Why should Google and Apertium sponsor it ?[edit]
Polish and Ukrainian are closely related, since both belonging to the group of Slavic languages. 
Ukrainian is my native language. Polish is fluent for me, because I spend one year in Poland, where I have studied at the University of Warsaw. Also I completed a course of Polish for foreign students. So I am really able to do this language pair.
For Apertium one of main priorities is new language pairs. Slavic languages are not strongly represented in Apertium yet. So new language pair which consists from two Slavic languages should be accepted by Apertium community.
Ukrainian is low resourced language, so it is great chance to support it.
How and who will it benefit in society ?[edit]
Ukraine and Poland are neighboring countries. These countries have close cultural and economic relations. In Western Ukraine most Ukrainians, especially older, understands Polish, but in Central and Eastern parts of Ukraine people do not know Polish at all. 
Each language have more than 40 mln native speakers. Polish is a West Slavic language and the official language of Poland. Its written standard based on a Latin alphabet with a few additions. Ukrainian is a East Slavic language and the official language of Ukraine. Ukrainian is written using a modified version of the Cyrillic alphabet. These languages are highly inflective.
There are very few computer applications translated between Polish and Ukrainian.
Oldest and well known is Pragma – rule-based translated software developed in 2000 by Trident Software (Kyiv, Ukraine). Pragma expands functionality of popular office and Internet applications by adding translation function to them. Pragma is a closed source, Windows only application. Pragma software is used by government institutions in Ukraine, large companies  and  small business.
Google Translate - worldwide known multilingual free online translation service to translate a section of text, or a whole webpage. Unlike other translation services which use SYSTRAN rule-based MT technology, Google uses its own translation software based on statistical approach. Polish and Ukraine launched in May and September 2008 respectively.
But still no open source solutions exist :(.
Polish-Ukrainian machine translation is very actual in a context of the European Football Championship which will take place in Ukraine and Poland in summer 2012. For providing language needs of the championship Web-resources are created, for example http://www.eurolang2012.com/. EUROLANG is mostly language textbook, it seems sites like that could be potential users of Polish-Ukrainian MT.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I am going to work on a project Apertium-pl-uk: Machine translation between Polish and Ukrainian.
Work plan[edit]
Firstly I want to say, I do not starting from scratch. Bilingual Polish-Ukrainian dictionary exists in Apertium format. I already checked it with a mentor Jimmy O’Regan. Of course it needs expanding by adding high frequently words and correction some translations.
Most comfortable way to build any kind of dictionaries is to use well formed annotated corpus. That’s why for Polish I will use Polish IPI PAN Corpus (250 mln words) - GNU GPL (http://korpus.pl/index.php?page=download) and open source corpus manager Poliqarp - Creative Commons License (http://korpus.pl/index.php?page=poliqarp). IPI PAN Corpus is POS- and morpho-syntactically tagged, it gives a possibility to get all necessary information to build Polish monolingual dictionary. Existing Polish monolingual dictionary will be helpful too, because it describes a lot of word inflections.
There is not available open source corpus for Ukrainian :(. That’s why I am going to use Ukrainian Wikipedia to form Ukrainian frequency list. Also I will use Aspell Ukrainian Dictionary (http://sourceforge.net/projects/ispell-uk). It is distributed under GNU GPL and contains approximately 100,000 lemmas. Aspell Dictionary consists from two files:
- dictionary – includes lemmas and affix labels which describes inflection rules for each lemma
- list of affixes – includes affixes for each type of label.
There is no explicit information about Parts of Speech in this dictionary, but inflection categories are described. Since different parts of speech has different inflection rules – getting POS information from this vocabulary is  possible. Summing up, converting Aspell Dictionary to the Ukrainian monodix for Appertium is realistic task.
I intend to fill up each dictionary to at least 5,000 words.
Also defining of transcript rules is needed, because of using of different alphabets in each language.
Community Bonding Period[edit]
Learning and comparison of Polish and Ukrainian grammar;
Studying Apertium and its documentation;
Evaluating and correcting existing dictionaries.
Week 1[edit]
Forming frequency list for Ukrainian from Wikipedia.
Working on Ukrainian morphological dictionary, focusing on closed class words.
Week 2[edit]
Making of paradigms for Ukrainian morphological dictionary.
Starting adding open class words.
Week 3[edit]
Continuation of work on Ukrainian morphological dictionary by adding open class words.
Week 4[edit]
Working with existing Polish morphological dictionary, adding closed class words.
Week 5[edit]
Making of paradigms for Polish morphological dictionary.
Starting adding open class words.
Week 6[edit]
Adding open class words to Polish morphological dictionary.
Week 7[edit]
Synchronizing  bilingual and monolingual dictionaries.
After that size of each dictionary should be 5,000 minimum.
Week 8[edit]
Working on transfer rules.
Week 9[edit]
Continuation of work on transfer rules.
Week 10[edit]
Testing and manual correction of dictionaries.
Adding transcription rules.
Week 11[edit]
Continuation testing and manual correction of dictionaries.
Week 12[edit]
Evaluating accuracy and project documenting.
List your skills and give evidence of your qualifications[edit]
Presently I am 2nd year PhD student in Computer Sciences at Lviv Polytechnic University. 
I have been one year (2007-2008) at International Visegrad Master Program at the University of Warsaw.
 
I attended courses "Methods and Tools for Text Processing" (Prof. Janusz Stanisław Bień) and "Linguistic Engineering" (Dr. Adam Przepiórkowski). This made me familiar with different NLP techniques and corpus standards, like text segmentation, HMM, PoS tagging,  XML, TEI, Unicode, etc.
I successfully completed a course of Polish  for foreign students (level B2).
I was partly enrolled in project Polish-Ukrainian parallel corpus (http://www.domeczek.pl/~polukr/), I worked in automatic sentence boundary detection there.
Also I was a participant of session “Text corpora in the linguistic research” (University of Warsaw, Poland, June 2007) and European Summer School "Culture & Technology" (Leipzig University, Germany, July 2009).
As I told before, I have an experience of Polish-Ukrainian translation. I made Ukrainian translation part of the book Adam Przepiórkowski “The IPI PAN Corpus: Preliminary version” (http://nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/) and Ukrainian localization of the Poliqarp’s (an open source corpus manager) with a use OmegaT – free translation memory application.
List any non-Summer-of-Code plans you have for the Summer[edit]
GsoC is my only plan for the summer. I can work full time on the project.

