User:Sereni

From Apertium
Revision as of 06:34, 13 May 2014 by Sereni (talk | contribs) (→‎Proposal)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Personal info[edit]

Name: Ekaterina Ageeva

Email: sereni.nm@gmail.com

IRC: Sereni

Why is it you are interested in machine translation?[edit]

I would like to work in computational linguistics, so I view machine translation as one of the possible areas. MT appeals to me because it is challenging for developers and useful for everyone. It also speaks to the idea of free and accessible information, with texts in one language instantly understandable for speakers of others. I believe MT can contribute to multicultural understanding, which is something I value.

Why is it that you are interested in the Apertium project?[edit]

Apertium does work related to linguistics, which means I am interested and also understand a few things in the area. I participated in several linguistics and programming-related projects, but this is an opportunity to write a complete piece of software that will go on into the larger system and be useful. As to why this particular project, I believe it matches my skills, and I have a fairly good idea of what should be done. I think it could be a good place to start, because I am interested in gaining skills and experience in more complex projects in Apertium after this summer.

My professional interests in this project are to (1) practice writing quality code, (2) learn to make a finished piece of software that integrates into a larger system, (3) practice conducting linguistic experiments, (4) (possibly) write a follow-up paper.

My personal interests include (1) becoming a part of the open-source community, (2) contributing to an open-source project (Apetium in particular, because it matches my area of expertise), (3) spending summer break on an activity that benefits more people than just me.

Proposal[edit]

I am interested in making the toolkit for gisting evaluation of Apertium language pairs. As stated in [1], the main purpose of online machine translation is gisting, that is, users try to understand the main sense of the text as opposed to getting an editable translation. It would be useful to have a tool for evaluating how well Apertium language pairs do in such contexts. Such evaluation would point out the pairs ready for release, thus increasing language cover, and it would also provide a quantitative scale for quality measurement. Since evaluation is human-based, it is necessary to develop a framework which will allow to objectively compare each individual evaluation. I propose to create a toolkit that, given a language pair and parallel texts in these languages, generates tests for human evaluators, checks their answers and calculates success rate based on the kind of information provided to users. The system will include text and web-based interfaces. It will also feature different ways of testing based on the form of questions and amount of information provided to users.

Amount of information:

• Original sentence + reference sentence (used for baseline score)

• Original sentence, reference sentence + machine translation (for evaluation)

Types of questions:

• Simple gaps (an omitted keyword)

• Gaps with multiple choice (users are provided with words to choose from)

• Gaps with lemmas (a keyword lemma is shown, the user is required to enter the correct grammatical form)

Different types of questions will require different keyword selection techniques. For simple gaps, we determine keywords by co-occurrence (as in [2], for example) and part of speech tags. For multiple choice, words for choices are extracted from the same text by grammar tags and also by length, as described in [3]. For lemmas, the algorithm is yet to be discussed, with a reservation that verbs rather than nouns will be removed in this case.

In order to test the toolkit, I propose to run evaluations on several language pairs, possibly the ones being developed or improved as a part of GSoC project.

Work plan

Pre-work period (1-21 April)

Familiarise myself with Apertium. Get accustomed to working in Unix to ensure seamless workflow. Explore the existing works on gisting evaluation.

Community bonding period (21 April – 19 May)

Discuss keyword selection for gaps with lemmas; make draft of selection methods. Learn how to integrate Apertium with Python applications. Discuss interface features with mentors.

Work period

NB: work periods include writing documentation on wiki as I go.

Week 1. Create an algorithm for keyword extraction in simple gaps. Test it on Russian and English data by comparing to results obtained using corpora and tf-idf. Write base code that creates sets of {orginal sentence, machine translation, reference translation with gaps, answer key} from text files.

Week 2. Create a method to determine significant grammatical features in different languages for multiple choice gaps. Create an algorithm that selects words for multiple choice gaps. Test in on Russian and English. Possibly find a speaker of non-Indoeuropean language for testing.

Week 3. Design rules for gaps with lemmas. Update code to create multiple choice gaps and gaps with lemmas. Include the possibility to adjust gap density and parts of speech to be removed.

Week 4. Develop the text-based interface: create text files with tasks, extract answers from returned text files.

Deliverable 1: A program in Python that creates three varieties of tests given parallel texts and a language pair.

Week 5. Get familiar with command line interface creation. Develop the command line interface to wrap text generation.

Week 6. Develop the web-based interface. It will include a landing page for evaluators with choice of language pair and testing method (this can also be randomly assigned), admin for managing the database and a pretty (public?) stats page. Host it on the web.

Deliverable 2: Text-based interface with command line wrapper and web-based interface for the toolkit.

Weeks 7-10. Improving the Avar -> Russian language pair (details to be discussed).

Week 11. User acceptance testing: perform gisting evaluation of English -> Kazakh and Tatar -> Russian language pairs. Find texts and informants. Ensure the balance of testing methods.

Week 12. User acceptance testing: perform gisting evaluation of 2-3 existing language pairs (continued). Analyze and summarise the results.

End product: Gisting evaluation toolbox with text and web interfaces

References

[1] Jim O'Regan, Mikel L. Forcada: Peeking through the language barrier: the development of a free/open-source gisting system for Basque to English based on apertium.org. Procesamiento del Lenguaje Natural 51: 15-22 (2013)

[2] Yutaka Matsuo, Mitsuru Ishizuka: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(1): 157-169 (2004)

[3] Trond Trosterud, Kevin Brubeck Unhammer. Evaluating North Sámi to Norwegian assimilation RBMT. Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012); 06/2012

Skills[edit]

I am a second grade student of Computational Linguistics in Higher School of Economics, Moscow. I have a good understanding of linguistic theory: phonetics, morphology, syntax and semantics. I am a native speaker of Russian, have a good command of English and some basic knowledge of French. My linguistics experience includes corpus studies of Russian morphology and small-scale projects in sociolinguistics and lexical typology. I program in Python (about 2 years' experience with 1-year formal course) and have worked with Django. I have written a parser for Russian verbs and developed a testing system for one of the courses in my university (web interface found here, in Russian: [1]). I am currently developing a corpus for computer-aided musicology. I have also completed the coding challenge for the project. The web version can be found at [2].

Availability[edit]

I have no employment or vacation plans for summer. My school year, however, ends in the middle of June, so between May 19 and June 15 I will spend 20 to 25 hours a week on the project. I am able and willing to compensate for this after June 15 by working 40 hours a week. During the community bonding period, I may not be available online from April 28 to May 10 due to participation in the linguistic expedition to Daghestan. I therefore plan to start familiarising myself with Apertium immediately.