User:Aha/GsocApplication

From Apertium
< User:Aha
Revision as of 15:32, 31 March 2010 by Aha (talk | contribs) (Created page with '== Name == Joanna Ruth == Contact information == E-mail: [mailto:joannaruth1@gmail.com joannaruth1@gmail.com]<br /> Skype: joanna_ruth<br /> IRC: Aha_<br /> == Wh…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Name

Joanna Ruth

Contact information

E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

Which of the published tasks are you interested in?

How and who it will benefit in society?

Work plan

Some work has already been done for the Polish-Czech language pair. I consulted with Jimmy O'Regan about the Polish monodix and found out that inflection rules are already covered and I should focus mainly on expanding the vocabulary.

I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and Morfo (Czech morphological analyser)

Community Bonding Period

  • set up work environment (installation and configuration)
  • study Polish and Chech language rules thoroughly
  • check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
  • prepare a detailed list of morphological rules that are missing
  • get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
  • prepare a list of words sorted by frequency of occurance for both dictionaries (to acquire at least 80% coverage)
  • learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

  • write test scripts (make use of the existing language-pair regression and corpus tests)
  • add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

  • work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

  • work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

  • add other words that are left

Deliverable1: desirable coverage acquired for both languages

Week5

  • gather translational data with the use of parallel corpora
  • add basic transfer rules for the purpose of testing, verify the tag definition files
  • work on bilingual dictionary

Week6

  • work further on bilingual dictionary

Week7

  • prepare a list of word sequences that frequently appear together for both Polish and Czech (perhaps use frequent sets)
  • add multiwords with traslations to the dictionaries

Week8

  • bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9

  • obtain hand-tagged training corpora
  • study the word order rules of Czech and Polish
  • work on tag definition files
  • carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10

  • work on transfer rules

Week11

  • carry out thorough regression tests
  • check dictionaries manually to spot possible errors

Week12

  • clean up, evaluation of results

Project completed