Difference between revisions of "User:Aha/GsocApplication"

From Apertium
Jump to navigation Jump to search
Line 18: Line 18:
   
 
== Why Google and Apertium should sponsor it? ==
 
== Why Google and Apertium should sponsor it? ==
  +
  +
New language-pair incorporation is Apertium's top priority. Introduction of each language increase the number of people using it. There is very little support for West-Slavic languages in Apertium at present. None of the languages from this group is among language-pairs in release or stable versions of the project. Polish-English and Czech-Slovenian are currently under
  +
development, but there is a lot to be done yet to make them work. Perhaps covering the Polish-Czech pair would help to make the other mentioned pairs work better.
   
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
== Which of the published tasks are you interested in? What do you plan to do? ==

Revision as of 21:57, 1 April 2010

Name

Joanna Ruth

Contact information

E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_

Why are you interested in machine translation?

Before I took up Computer Science I had thought about being a language teacher as I've always enjoyed learning foreign languages and exploring cultures. Eventually my interest in programming and technology took the lead, but it turned out I can still expand my knowledge in the field of linguistics by means of Natural Language Processing. Machine translation, being a sub-field of NLP, enables to dive deep into the grammar of a language and deal with it from a computational perspective. I really like the idea of automatic text translation especially nowadays, when the Internet is growing so rapidly. It is impossible to translate all the information manually - MT is the only hope that is left.

Why are you interested in the Apertium project?

I'm strongly convinced that the only chance for machine translation project to be successful it to realize it through open-source. Only within multinational, motivated community like Apertium's it is possible to accomplish such numerous language-pair translation. The project supports both widely spoken languages and minority languages and that makes it stand in the crowd.

Why Google and Apertium should sponsor it?

New language-pair incorporation is Apertium's top priority. Introduction of each language increase the number of people using it. There is very little support for West-Slavic languages in Apertium at present. None of the languages from this group is among language-pairs in release or stable versions of the project. Polish-English and Czech-Slovenian are currently under development, but there is a lot to be done yet to make them work. Perhaps covering the Polish-Czech pair would help to make the other mentioned pairs work better.

Which of the published tasks are you interested in? What do you plan to do?

The project I'd like to work on is Polish-Czech language pair machine translation for Apertium.

Some work has already been done for this language pair. I consulted with Jimmy O'Regan and found out that inflection rules for Polish are already covered and I should focus mainly on expanding the vocabulary. The Czech part is more or less on the same stage.

I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and, perhaps, from Morfo (Czech morphological analyser).

Czech as well as Polish, is a West-Slavic language and consequently they have a lot in common. Both are highly inflected languages (7 cases for nouns, pronouns, adjectives and numerals). Genders are almost the same (in polish there is additional personal masculine gender). Word order is more or less the same, however Czech allows for more freedom what might be a bit problematic. Nevertheless the translation should give fine results due to the closeness of this two languages.

How and who it will benefit in society?

Work plan

Community Bonding Period

  • set up work environment (installation and configuration)
  • study Polish and Chech language rules thoroughly
  • check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
  • get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
  • prepare a list of words sorted by frequency of occurance for both dictionaries (to acquire at least 80% coverage)
  • learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

  • write test scripts (make use of the existing language-pair regression and corpus tests)
  • add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

  • work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

  • work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

  • add the rest of the words

Deliverable1: Desirable coverage acquired for both languages

Week5

  • gather translational data with the use of parallel corpora
  • add basic transfer rules for the purpose of testing, verify the tag definition files
  • work on bilingual dictionary

Week6

  • work further on bilingual dictionary

Week7

  • prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets)
  • add multiwords with traslations to the dictionaries

Week8

  • bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9

  • obtain hand-tagged training corpora
  • study the word order rules of Czech and Polish
  • work on tag definition files
  • carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10

  • work on transfer rules

Week11

  • carry out thorough regression tests
  • check dictionaries manually to spot possible errors

Week12

  • clean up, evaluation of results

Project completed

During the whole work the quality of translations will be controled by means of regression and vocublary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications

I’m currently on first year of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and I receive scholarship for high academic achievements. During my studies I did a lot of programming (mainly c/c++, java and C#) and I have attended (among others) courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. Within the last mentioned I learned how the compiler works and I generated simple lexical, syntactic and semantic analizers for pascal and ada languages using flex, bison and yacc. I also attended a course in artificial intelligence where I learned about the hidden Markov model and neural networks.

I haven't participated in open-source project so far, but I've been involved in several research project at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.

I have been an intern in Speednet company for 1,5 year. During this time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system. I got familiar with software localisation and I used MT to automate translation between Polish and English.

I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and a little bit of Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, thanks to it's similarity to Polish language, I can understand it quite well. I think I can manage to successfully realize a translator for this language pair.

My non-Summer-of-Code plans for the Summer

I have no other plans for the Summer than GSoC program. I intended to apply for a job, but if I my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified – perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July.