Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:Mary.szmary/proposal

From Apertium
Jump to: navigation, search

Contents

[edit] Contact information

Name: Maria Sheyanova
E-mail: masha.shejanova@gmail.com
IRC: mary-szmary
SourceForge: maryszmary
Phone number: +79169223114
Timezone: UTC+3

[edit] Why is it you are interested in machine translation?

On the one hand, I'm a student of linguistic faculty, so working with language material and understanding more about language structure while contributing to machine translation systems is one of my primary interests as a linguist. On the other hand, I also have a great interest in coding and I'd like to use any opportunity to learn to code better or learn new things concerning programming. Working with machine translation seems to ideally fit both interests.


[edit] Why is it that you are interested in the Apertium project?

Apertium does rule-based machine translation. On the contrast to, it seems, more widespread nowadays corpus-based translation, it requires working with language structure, which it's attracts me as a linguist. It is also very attractive for me that Apertium supports the minority languages, because working with them is much more interesting that with the big ones.


[edit] Which of the published tasks are you interested in? What do you plan to do?

Adopt Polish -> Russian language pair

[edit] Reasons why Google and Apertium should sponsor it

Currently the pol-rus language pair is in the beginning state (in the incubator). There are very few words in the bilingual dictionary and no rules. The pol dictionary also does not have enough words. My goal is to fill the dictionaries, make as many rules as I'll be able to and bring it near to the release quality.

[edit] A description of how and who it will benefit in society

The result of this work is going to be a free and open source Polish-Russian translation system.


[edit] Field of work and available resources

Morphological dictionaries There is already a good and complete dictionary for Russian and a not so complete Polish dictionary, which needs to be improved. For this purpose the Polish version of wiktionary, SGJP or SJP PWN can be used.

[edit] Bilingual dictionary

To the moment before I started working with it for the coding challenge there were almost nothing in the dictionary, so a lot of work is to be done. There are a number of online dictionaries which will be usefull for this: wiktionary, Glosbe and a number of dictionaries here.

[edit] Parallel corpora

For the purposes of getting data for the rules of all kinds I’m going to use pol-rus corpora. There are a number of corpora avaiable: a pol-rus section on Ruscorpora, and some of them here.


[edit] Work plan

[edit] Overview

post application period

  • Getting closer with Apertium, reading more documentation about its systems and tools
  • Working on the 'James and Mary' translation (translating the story full coverege, writing some lesical choice rules for it, getting the baseline word error rate)
  • Improving my knowledge of bash

community bonding period

  • Closer examination and evaluation of the resources of language data that can be used, i.e:
    • looking for other possible resources;
    • evaluating the usefulness of the resources;
  • Learning more about the possible problems

work period

  • 1st month: filling and testing the Polish and bilingual dictionaries
  • 2nd month: writing lexical choice and transfer rules
  • 3rd month: writing transfer rules, evaluating, testing, last fixes

[edit] Schedule

week 1: write scripts to get missing words for the Polish dictionary (using mostly wikisłownik and PWN, but maybe also some downloadable dictionaries)
weeks 2-3: write scripts to get translations for the bilingual dictionary (using mostly wikisłownik and online websites)
week 4: check the completeness of the dictionaries (I think I can use Russian and Polish corpora for that)
Deliverable #1
week 5-6: write the lexical choise (consider generating them automatically using corpora I have access to)
week 7: estimate the validity of the rules
week 8: start writing the transfer rules
27 June: midterm evaluations deadline
Deliverable #2
week 9-10: write the transfer rules
week 11: evaluating, testing
week 12: clean up the code, last fixes, writing documentation
Project completed: a language pair of release quality or close to it

[edit] List your skills and give evidence of your qualifications

I'm a 3rd year bachelor student of Linguistic Faculty in NRU HSE (Russia).
Languages: Russian (native), Polish, English, Toki Pona :), German, basic knowledge of Indonesian.
Programming skills: Python (both 2nd and 3rd), R, basic knowledge of bash.
Other computer skills: HTML, XML, CSS.

As a part of the coding challenge, I’ve done the following:

  • added prepositions to bidix using the polish version of wiktionary (30 entries)
  • added adverbs (about 1100 entries), adjectives (about 7500 entries), conjunctions (about 150 entries), numerals and nouns (about 12 000 entries) to bidix by authomatic requests to an online-dictionary
  • wrote a couple of lexical choice rules

All scripts and materials for the coding challenge are here.

[edit] List any non-Summer-of-Code plans you have for the Summer

I have exams till 3rd-4th weeks of June so I won't be able to work full-time at this period, but I can spend 20-25 hours per week on the task. After the end of exams I'm going to visit my parents for some 4-5 days and also would be able to spend only 25-30 hours per week on the task. After that I'm ready to work full time and spend up to 45-50 hours on the task.

Personal tools