User:Aha/GsocApplication

Name

Joanna Ruth

Contact information

E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

Which of the published tasks are you interested in?

How and who it will benefit in society?

Work plan

Some work has already been done for the Polish-Czech language pair. I consulted with Jimmy O'Regan about the Polish monodix and found out that inflection rules are already covered and I should focus mainly on expanding the vocabulary.

I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and Morfo (Czech morphological analyser)

Community Bonding Period

set up work environment (installation and configuration)
study Polish and Chech language rules thoroughly
check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
prepare a detailed list of morphological rules that are missing
get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
prepare a list of words sorted by frequency of occurance for both dictionaries (to acquire at least 80% coverage)
learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

write test scripts (make use of the existing language-pair regression and corpus tests)
add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

add other words that are left

Deliverable1: desirable coverage acquired for both languages

Week5

gather translational data with the use of parallel corpora
add basic transfer rules for the purpose of testing, verify the tag definition files
work on bilingual dictionary

Week6

work further on bilingual dictionary

Week7

prepare a list of word sequences that frequently appear together for both Polish and Czech (perhaps use frequent sets)
add multiwords with traslations to the dictionaries

Week8

bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9

obtain hand-tagged training corpora
study the word order rules of Czech and Polish
work on tag definition files
carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10

work on transfer rules

Week11

carry out thorough regression tests
check dictionaries manually to spot possible errors

Week12

clean up, evaluation of results

Project completed

User:Aha/GsocApplication

Contents