Difference between revisions of "User:Aha/GsocApplication"

Revision as of 16:03, 31 March 2010

Name

Joanna Ruth

Contact information

E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

Which of the published tasks are you interested in?

How and who it will benefit in society?

Work plan

Some work has already been done for the Polish-Czech language pair. I consulted with Jimmy O'Regan about the Polish monodix and found out that inflection rules are already covered and I should focus mainly on expanding the vocabulary.

I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and Morfo (Czech morphological analyser)

Community Bonding Period

set up work environment (installation and configuration)
study Polish and Chech language rules thoroughly
check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
prepare a detailed list of morphological rules that are missing
get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
prepare a list of words sorted by frequency of occurance for both dictionaries (to acquire at least 80% coverage)
learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

write test scripts (make use of the existing language-pair regression and corpus tests)
add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

add other words that are left

Deliverable1: desirable coverage acquired for both languages

Week5

gather translational data with the use of parallel corpora
add basic transfer rules for the purpose of testing, verify the tag definition files
work on bilingual dictionary

Week6

work further on bilingual dictionary

Week7

prepare a list of word sequences that frequently appear together for both Polish and Czech (perhaps use frequent sets)
add multiwords with traslations to the dictionaries

Week8

bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9

obtain hand-tagged training corpora
study the word order rules of Czech and Polish
work on tag definition files
carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10

work on transfer rules

Week11

carry out thorough regression tests
check dictionaries manually to spot possible errors

Week12

clean up, evaluation of results

Project completed

During the whole work the quality of translations will be controled by means of regression and vocublary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications

I’m currently on first year of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and I receive scholarship for high academic achievements. During my studies I had (among others) the following subjects: algorithms and data structures, logic, programming (c/c++, java, c#, prolog), artificial intelligence, operating systems (shell scripting, regular expressions), methods of information representation (xml, dtd, xslt), automata theory and formal languages (flex analyser, bison, yacc). Within the last one I realized some simple lexical, syntactic and semantic analizers. I haven't participated in open-source project so far, but I've been involved in several research project at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.

I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and a little bit of Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, thanks to it's similarity to Polish language, I can understand it quite well. I think I can manage to successfully realize a translator for this language pair.