Difference between revisions of "User:Aha/GsocApplication"
m |
|||
(13 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
E-mail: [mailto:joannaruth1@gmail.com joannaruth1@gmail.com]<br /> |
E-mail: [mailto:joannaruth1@gmail.com joannaruth1@gmail.com]<br /> |
||
Skype: joanna_ruth<br /> |
Skype: joanna_ruth<br /> |
||
IRC: Aha_<br /> |
IRC: Aha_(irc.freenode.net)<br /> |
||
== Why are you interested in machine translation? == |
== Why are you interested in machine translation? == |
||
Before I took up Computer Science I had thought about being a language teacher as I've always enjoyed learning foreign languages and exploring cultures. Eventually my interest in programming and technology took the lead, but it turned out I can still expand my knowledge in the field of linguistics by means of Natural Language Processing. Machine translation, being a sub-field of NLP, enables to explore the grammar of a language and deal with it from a computational perspective. I really like the idea of automatic text translation especially nowadays, when the Internet is growing so rapidly. It is impossible to translate all the information manually. MT enables translations, which are (at present) less accurate than those made by human translators, but in many cases sufficient. I can't wait to see people of different nations communicating with one another without the need to know the languages used by their interlocutors. I'm sure the future has it in store. |
|||
== Why are you interested in the Apertium project? == |
== Why are you interested in the Apertium project? == |
||
I'm strongly convinced that the only chance for machine translation project to be successful it to realize it through open-source. Only within multinational, motivated community like Apertium's it is possible to accomplish such numerous language-pair translation. The project supports both widely spoken languages and minority languages. In the age of globalization it is a very important issue as many languages are in danger of dying away. |
|||
== Why Google and Apertium should sponsor it? == |
== Why Google and Apertium should sponsor it? == |
||
New language-pair incorporation is Apertium's top priority. Introduction of each language might significantly increase the number of people using it. There is very little support for West-Slavic languages in Apertium at present. None of the languages from this group is among language-pairs in release or stable versions of the project. Polish-English and Czech-Slovenian are currently under development, but there is a lot to be done yet to make them work. Developing the Polish-Czech pair would help to make the other mentioned pairs work better. Apertium has proven to be a very good platform for closely related languages like Polish and Czech therefore I think bringing this pair to Apertium will be very beneficial and should give high-quality results. |
|||
⚫ | |||
== How and who it will benefit in society? == |
== How and who it will benefit in society? == |
||
A great number of people use Internet as a primary source of information. Because of language barrier the amount of data that might be of use to them is limited to data available in the languages they speak. Introduction of Polish-Czech language pair in Apertium might help a lot in this respect. Polish and Czech are very close languages and thanks to that Polish people can usually understand Czech (and vice versa). Misunderstandings occur however relatively often because of so-called false friends - words that sounds or look similar but differ in meaning. Development of Polish-Czech language pair might solve this problem. It will also bring along other benefits: better software localisation and quicker text translation by human translators (they might use Apertium to obtain preliminary translation). |
|||
⚫ | |||
⚫ | |||
⚫ | |||
The project I'd like to work on is '''''Polish-Czech language pair machine translation for Apertium'''''. |
|||
⚫ | |||
The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and Morfo (Czech morphological analyser) |
|||
⚫ | |||
⚫ | I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and from Morfo (Czech morphological analyzer). |
||
Czech like Polish, is a West-Slavic language and consequently they have a lot in common. Both languages are highly inflected languages including 7 cases for nouns, pronouns, adjectives and numerals. Genders are almost the same, however, in Polish there is additional personal masculine gender. Word order is more or less the same, but Czech allows for more freedom which may pose a challenge in translation. Nevertheless the translation would be expected to give fine results due to the closeness of these two languages. |
|||
I've already got quite familiar with Apertium framework. I added some words and paradigms to the dictionaries and updated the pending tests for Polish-Czech language pair. |
|||
Currently Polish monodix contains 199 paradigms/518 lemmas and Czech monodix contains 216 paradigms/1148 lemmas. There are 502 entries in the bilingual dictionary, but only around 50-60 words can be translated correctly for each direction. |
|||
⚫ | |||
=== Community Bonding Period === |
=== Community Bonding Period === |
||
* set up work environment (installation and configuration) |
* set up work environment (installation and configuration) |
||
* study Polish and |
* study Polish and Czech language rules thoroughly |
||
* check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs) |
* check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs) |
||
* prepare a detailed list of morphological rules that are missing |
|||
* get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis) |
* get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis) |
||
* prepare a list of words sorted by frequency of |
* prepare a list of words sorted by frequency of occurrence for both dictionaries (to acquire at least 80% coverage) |
||
* learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise |
* learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise |
||
=== Week1 === |
=== Week1 === |
||
Line 47: | Line 61: | ||
=== Week4 === |
=== Week4 === |
||
* add |
* add the rest of the words |
||
<u>Deliverable1: |
<u>Deliverable1: Desirable coverage acquired for both languages</u> |
||
=== Week5 === |
=== Week5 === |
||
Line 58: | Line 72: | ||
=== Week6 === |
=== Week6 === |
||
* work further on bilingual dictionary |
* work further on bilingual dictionary |
||
* update the Polish-Czech page of the "False Friends of the Slavist" wikibook |
|||
=== Week7 === |
=== Week7 === |
||
* prepare a list of word sequences that frequently appear together for both Polish and Czech ( |
* prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets) |
||
* add multiwords with |
* add multiwords with translations to the dictionaries |
||
=== Week8 === |
=== Week8 === |
||
Line 70: | Line 85: | ||
=== Week9 === |
=== Week9 === |
||
* obtain hand-tagged training corpora |
* obtain hand-tagged training corpora |
||
* study the word order rules of Czech and Polish |
* study the word order rules of Czech and Polish (identify restrictions) |
||
* work on tag definition files |
* work on tag definition files |
||
* carry out supervised tagger training (with retraining on untagged text corpora) for both languages |
* carry out supervised tagger training (with retraining on untagged text corpora) for both languages |
||
=== Week10 === |
=== Week10 === |
||
* extract segments of the parallel corpora that are translated (more or less) literally |
|||
* work on transfer rules |
* work on transfer rules |
||
Line 86: | Line 102: | ||
<u>Project completed</u> |
<u>Project completed</u> |
||
During the whole work the quality of translations will be |
During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page. |
||
== List your skills and give evidence of your qualifications == |
== List your skills and give evidence of your qualifications == |
||
I'm currently first year student of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and received scholarship for high academic achievements. During my previous studies I did a lot of programming mainly using c/c++, java and C#. I also have completed courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. I have learned how the compiler works and how to generate simple lexical, syntactic and semantic analyzers for pascal and ada languages using flex, bison and yacc. I also completed a course in artificial intelligence where I learned about the hidden Markov model and neural networks. |
|||
So far I haven't participated in open-source project, but I've been involved in several research projects at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar. |
|||
I have been working as an intern in [http://www.speednet.pl/home_en.htm Speednet] company for 1,5 year. During that time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system written in .NET Compact Framework. I became familiar with software localisation and used MT to automate translation between Polish and English. Apart from that I learned how to use TortoiseSvn and MantisBT. |
|||
In my projects I use PostgreSQL and Microsoft SQL Server DBMSes. Recently I also started a course in Oracle. I know .NET technology (windows forms, windows forms ce, wpf, wcf, silverlight) and the basics of JEE (servlets, jsp/jsf, facelets, JPA, JAAS, JMS). I'm also familiar distributted and parallel programming concepts. |
|||
⚫ | I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and some basic Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, because of it's similarity to Polish language, I can understand it quite well. I strongly believe I can manage to successfully realize a translator for this language pair. |
||
== My non-Summer-of-Code plans for the Summer == |
|||
have no other plans for the Summer than GSoC program. I intended to apply for a job, but if my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified - perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July. I'm sure there won't be any problems with me studying and working on the GSoC project simultaneously as I've already managed to work during 3 semesters of my studies. |
|||
⚫ | I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and |
Latest revision as of 20:56, 9 April 2010
Contents
- 1 Name
- 2 Contact information
- 3 Why are you interested in machine translation?
- 4 Why are you interested in the Apertium project?
- 5 Why Google and Apertium should sponsor it?
- 6 How and who it will benefit in society?
- 7 Which of the published tasks are you interested in? What do you plan to do?
- 8 Work plan
- 9 List your skills and give evidence of your qualifications
- 10 My non-Summer-of-Code plans for the Summer
Name[edit]
Joanna Ruth
Contact information[edit]
E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_(irc.freenode.net)
Why are you interested in machine translation?[edit]
Before I took up Computer Science I had thought about being a language teacher as I've always enjoyed learning foreign languages and exploring cultures. Eventually my interest in programming and technology took the lead, but it turned out I can still expand my knowledge in the field of linguistics by means of Natural Language Processing. Machine translation, being a sub-field of NLP, enables to explore the grammar of a language and deal with it from a computational perspective. I really like the idea of automatic text translation especially nowadays, when the Internet is growing so rapidly. It is impossible to translate all the information manually. MT enables translations, which are (at present) less accurate than those made by human translators, but in many cases sufficient. I can't wait to see people of different nations communicating with one another without the need to know the languages used by their interlocutors. I'm sure the future has it in store.
Why are you interested in the Apertium project?[edit]
I'm strongly convinced that the only chance for machine translation project to be successful it to realize it through open-source. Only within multinational, motivated community like Apertium's it is possible to accomplish such numerous language-pair translation. The project supports both widely spoken languages and minority languages. In the age of globalization it is a very important issue as many languages are in danger of dying away.
Why Google and Apertium should sponsor it?[edit]
New language-pair incorporation is Apertium's top priority. Introduction of each language might significantly increase the number of people using it. There is very little support for West-Slavic languages in Apertium at present. None of the languages from this group is among language-pairs in release or stable versions of the project. Polish-English and Czech-Slovenian are currently under development, but there is a lot to be done yet to make them work. Developing the Polish-Czech pair would help to make the other mentioned pairs work better. Apertium has proven to be a very good platform for closely related languages like Polish and Czech therefore I think bringing this pair to Apertium will be very beneficial and should give high-quality results.
How and who it will benefit in society?[edit]
A great number of people use Internet as a primary source of information. Because of language barrier the amount of data that might be of use to them is limited to data available in the languages they speak. Introduction of Polish-Czech language pair in Apertium might help a lot in this respect. Polish and Czech are very close languages and thanks to that Polish people can usually understand Czech (and vice versa). Misunderstandings occur however relatively often because of so-called false friends - words that sounds or look similar but differ in meaning. Development of Polish-Czech language pair might solve this problem. It will also bring along other benefits: better software localisation and quicker text translation by human translators (they might use Apertium to obtain preliminary translation).
Which of the published tasks are you interested in? What do you plan to do?[edit]
The project I'd like to work on is Polish-Czech language pair machine translation for Apertium.
Some work has already been done for this language pair. I consulted with Jimmy O'Regan and found out that most inflection rules for Polish are already covered and I should focus mainly on expanding the vocabulary. The Czech part is more or less at the same stage of development.
I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and from Morfo (Czech morphological analyzer).
Czech like Polish, is a West-Slavic language and consequently they have a lot in common. Both languages are highly inflected languages including 7 cases for nouns, pronouns, adjectives and numerals. Genders are almost the same, however, in Polish there is additional personal masculine gender. Word order is more or less the same, but Czech allows for more freedom which may pose a challenge in translation. Nevertheless the translation would be expected to give fine results due to the closeness of these two languages.
I've already got quite familiar with Apertium framework. I added some words and paradigms to the dictionaries and updated the pending tests for Polish-Czech language pair.
Currently Polish monodix contains 199 paradigms/518 lemmas and Czech monodix contains 216 paradigms/1148 lemmas. There are 502 entries in the bilingual dictionary, but only around 50-60 words can be translated correctly for each direction.
Work plan[edit]
Community Bonding Period[edit]
- set up work environment (installation and configuration)
- study Polish and Czech language rules thoroughly
- check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
- get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
- prepare a list of words sorted by frequency of occurrence for both dictionaries (to acquire at least 80% coverage)
- learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise
Week1[edit]
- write test scripts (make use of the existing language-pair regression and corpus tests)
- add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries
Week2[edit]
- work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week3[edit]
- work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week4[edit]
- add the rest of the words
Deliverable1: Desirable coverage acquired for both languages
Week5[edit]
- gather translational data with the use of parallel corpora
- add basic transfer rules for the purpose of testing, verify the tag definition files
- work on bilingual dictionary
Week6[edit]
- work further on bilingual dictionary
- update the Polish-Czech page of the "False Friends of the Slavist" wikibook
Week7[edit]
- prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets)
- add multiwords with translations to the dictionaries
Week8[edit]
- bring the dictionaries to a consistent state (successful vocabulary tests)
Deliverable2: Bilingual dictionary completed
Week9[edit]
- obtain hand-tagged training corpora
- study the word order rules of Czech and Polish (identify restrictions)
- work on tag definition files
- carry out supervised tagger training (with retraining on untagged text corpora) for both languages
Week10[edit]
- extract segments of the parallel corpora that are translated (more or less) literally
- work on transfer rules
Week11[edit]
- carry out thorough regression tests
- check dictionaries manually to spot possible errors
Week12[edit]
- clean up, evaluation of results
Project completed
During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.
List your skills and give evidence of your qualifications[edit]
I'm currently first year student of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and received scholarship for high academic achievements. During my previous studies I did a lot of programming mainly using c/c++, java and C#. I also have completed courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. I have learned how the compiler works and how to generate simple lexical, syntactic and semantic analyzers for pascal and ada languages using flex, bison and yacc. I also completed a course in artificial intelligence where I learned about the hidden Markov model and neural networks.
So far I haven't participated in open-source project, but I've been involved in several research projects at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.
I have been working as an intern in Speednet company for 1,5 year. During that time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system written in .NET Compact Framework. I became familiar with software localisation and used MT to automate translation between Polish and English. Apart from that I learned how to use TortoiseSvn and MantisBT.
In my projects I use PostgreSQL and Microsoft SQL Server DBMSes. Recently I also started a course in Oracle. I know .NET technology (windows forms, windows forms ce, wpf, wcf, silverlight) and the basics of JEE (servlets, jsp/jsf, facelets, JPA, JAAS, JMS). I'm also familiar distributted and parallel programming concepts.
I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and some basic Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, because of it's similarity to Polish language, I can understand it quite well. I strongly believe I can manage to successfully realize a translator for this language pair.
My non-Summer-of-Code plans for the Summer[edit]
have no other plans for the Summer than GSoC program. I intended to apply for a job, but if my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified - perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July. I'm sure there won't be any problems with me studying and working on the GSoC project simultaneously as I've already managed to work during 3 semesters of my studies.