User:Aha/GsocApplication
Contents
Name
Joanna Ruth
Contact information
E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_
Why are you interested in machine translation?
Why are you interested in the Apertium project?
Why Google and Apertium should sponsor it?
Which of the published tasks are you interested in?
How and who it will benefit in society?
Work plan
Some work has already been done for the Polish-Czech language pair. I consulted with Jimmy O'Regan about the Polish monodix and found out that inflection rules are already covered and I should focus mainly on expanding the vocabulary.
I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and Morfo (Czech morphological analyser)
Community Bonding Period
- set up work environment (installation and configuration)
- study Polish and Chech language rules thoroughly
- check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
- prepare a detailed list of morphological rules that are missing
- get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
- prepare a list of words sorted by frequency of occurance for both dictionaries (to acquire at least 80% coverage)
- learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise
Week1
- write test scripts (make use of the existing language-pair regression and corpus tests)
- add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries
Week2
- work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week3
- work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week4
- add other words that are left
Deliverable1: desirable coverage acquired for both languages
Week5
- gather translational data with the use of parallel corpora
- add basic transfer rules for the purpose of testing, verify the tag definition files
- work on bilingual dictionary
Week6
- work further on bilingual dictionary
Week7
- prepare a list of word sequences that frequently appear together for both Polish and Czech (perhaps use frequent sets)
- add multiwords with traslations to the dictionaries
Week8
- bring the dictionaries to a consistent state (successful vocabulary tests)
Deliverable2: Bilingual dictionary completed
Week9
- obtain hand-tagged training corpora
- study the word order rules of Czech and Polish
- work on tag definition files
- carry out supervised tagger training (with retraining on untagged text corpora) for both languages
Week10
- work on transfer rules
Week11
- carry out thorough regression tests
- check dictionaries manually to spot possible errors
Week12
- clean up, evaluation of results
Project completed