User:Fklubicka/Application2013-ces-hbs

From Apertium
Jump to navigation Jump to search

GSoC application: apertium sh-sl, implementation of a new language pair[edit]

Name: Filip Klubička

E-mail address: fklubicka@gmail.com

Other contact information: fklubicka on IRC: #apertium

Facebook

Why is it you are interested in machine translation?[edit]

As a student of both English and Information science with interests in translation and natural language processing in each respective field, I feel that machine translation is a perfect blend of both. I am a big fan of language and I enjoy studying it from the aspect of linguistics. I also feel that using ICT to do things, be it help people better understand each other, or simply search for information, is the key to a simpler day to day life. I am immensely interested in the field of machine translation and would like to delve deeper into and learn more about it.

Why is it that you are interested in the Apertium project?[edit]

As a native speaker of Croatian, a small and marginal, yet linguistically quite interesting language, I like the focus Apertium puts on languages like mine. There are not many language tools developed for Croatian and other similar languages, and those that do exist are seriously lacking in quality.

I also like the idea of Apertium being open-source – perhaps it seems like the only way a project aimed at marginalized languages could work at all, but I am sure it is the right way to do it – it gives people who genuinely care about their language the opportunity to help develop a tool all its speakers might use and that way keep it fresh and breathing.

Which of the published tasks are you interested in and why? What do you plan to do and why should Google and Apertium sponsor it - who and how will it benefit in society?[edit]

I have opted for the ces-hbs combination because the languages are grammatically close, which makes them quite appropriate for the construction of a rule-based MT pair - a practically very useful result can be obtained with relatively uncomplicated rules. What is important is that this particular language combination really can benefit someone - I can vouch for this first-hand, as I come from a town where 20% of the population belongs to the Czech minority. I am thus quite familiar with the language and the language barriers people face every day. The use for such a translation system is thus quite obvious, let alone the fact that putting such languages to the fore, rather than neglecting them, helps them not to be consumed by other, more prominent languages.

Education, skills and qualification:[edit]

As for my education and experience, I am an undergraduate student at the Faculty of Humanities and Social Sciences, University of Zagreb and, as already mentioned, I am majoring in English language and literature and Information science. The former has bestowed upon me, alongside practical language proficiency, an exceptional theoretical linguistic knowledge, while the latter has introduced me to the world of natural language processing. During my ICT courses I have been taught about machine translation, language engineering and formal languages, and have come in contact with Pascal, Delphi, SQL, ActionScript, JavaScript, HTML, CSS, XML and, most notably, Python, with which I have become very familiar.

My experience with open-source development has begun with Apertium some two months ago, when I started committing changes and fixes to the hbs-slv language pair. I have also begun the coding challenge for ces-hbs and have created a bidix based on the story provided for the coding challenge. I had my Bohemian friends proofread and translate the original text and have filled the bidix with lemmas derived from the texts. The work in progress can be found here here.

Work plan:[edit]

I would roughly outline my plan and possible sources for further developing the language pair during the summer as follows:

      • Week 1-2: Even up the closed word categories and include them in the bidix
      • Week 3: Even up the coverage of nouns (HBS: 4961/ CES: 6619), build the bidix
      • Week 4: Even up the coverage of proper nouns (HBS: 2369 / CES: 314), build the bidix
      • Week 5: Even up the coverage of adjectives (HBS: 6830 / CES: 2492) and adverbs (HBS: 463/ CES: 2705), build the bidix
      • Week 6: Even up the coverage of verbs (HBS: 2872 / CES: 636), build the bidix
      • Deliverable #1 Evened up morphological lexicons and bidix covering them
      • Week 7-8: train a statistical tagger with manually tagged corpora and write disambiguation rules along with that
      • Week 9-10: Write transfer rules
      • Deliverable #2 Disambiguation and transfer rules
      • Week 11-12: Final cleanup, testing on a corpus, final documentation
      • Deliverable #3: Final report of a complete HBS-CES bidirectional translation system

As resources I would use the available monolingual and bilingual lexicons (e.g. Eudict which has 10k-30k entries for this particular language pair), as well as monolingual and parallel corpora.

Plans for the summer:[edit]

My obligations during the summer are the exams I have to take somewhere mid-June and early July. Luckily, they are not many and should not hinder the initial stages of the project too much. Still, I will attempt to clear them as early as possible so that I might fully commit to the project. I also have to write a two final papers before July. Other than that, I will have no classes and no binding plans.