User:Debeck

From Apertium
Revision as of 05:22, 8 April 2010 by Debeck (talk | contribs)
Jump to navigation Jump to search

Name: Daniel Emilio Beck

daniel.e.beck@gmail.com


http://debeck.wikidot.com (under construction)

Why is it you are interested in machine translation?

I was introduced to the Natural Language Processing field (NLP) in March 2007, when I became a research assistant under the supervision of Prof. Aline Villavicencio, at Universidade Federal do Rio Grande do Sul (UFRGS). Since then, I learned about most of the “subfields” of NLP, including Machine Translation (MT).

Even though my work at UFRGS focused on Multiword Extraction, MT systems always picked my interest because they are a very good example of well-stablished systems that use most of the NLP concepts and resources I’ve seen in theory (tokenization, tagging, parsing, word sense disambiguation, etc.). Also, since current translations done by those systems aren’t perfect (and probably never will), this is a field with great room for research and improvement. Finally, being an open source software enthusiast, I believe knowledge in general (not just source code) should be spreaded around, and translations are essential to do that. Thus, being able to contribute to the construction of machine translation engines would be a very gratifying work.

Because of these factors, I’ve decided to work with MT in my research, had doing my graduation thesis and currently doing my master’s in that field.


Why is it that they are interested in the Apertium project?

I believe machine translation is a field that can get many contributions from an open source nature. The code reuse allied with the modularity and the possibility to get help from a wider range of people around the world (and therefore, from different languages) could make the construction of new language pairs easier and faster.

Also, I find the Apertium focus on less used languages very interesting because it helps to create NLP resources in these languages. I had dealt with this lack of resources for non-English languages in my research sometimes, so it would be nice to be able to help on the construction of these resources.

Finally, I worked with Apertium in my graduation thesis and probably will work with it in my master’s. It will be nice to be able to contribute to a project which I used in my research because it can help other people (and probably myself) to do their research too.


Which of the published tasks are you interested in? What do you plan to do?

I am interested in the Complex Multiwords task, as announced in the ideas site. In my grad thesis, I started to build a multiword module for this task but it was restricted to multiwords with 2 components and it didn’t use finite-state transducers (which are used by the analysis and generation modules in Apertium). So, the idea is to improve this module by allowing multiwords with any number of components and using transducers in both analysis and generation.

Proposal

It consists of a multiword processing module, a multiword specialized dictionary template and a dictionary compiler. The module is going to be placed between the analyser and the POS tagger. It’s purpose is to join the multiwords by analysing the single word components lexical forms. The process is described as below:


First, it will detect multiword candidates by checking the lemmata of the lexical forms and matching them with the entries present in the multiword dictionary.

Then it will confirm that the candidate is a multiword by checking each tag group combination. If at least one combination is matched with one agreement template (which will be in the multiword dictionary), the candidate is confirmed.

Finally it will generate the output. If it’s a multiword, the output will be a single lexical form with each tag group generated by the agreement templates detected. If it isn’t, the output will be a copy of the input (separate lexical forms).


Also, the module will do the reverse process after the transfer: detect multiwords lexical forms and output the single word components lexical forms by checking the multiword dictionary.

As I noted before, there’s already a prototype of this module, hosted in the Apertium repository, that I made for my grad thesis but that doesn’t solve all the issues described in the problem. This module will be used as a start point for the development of the project.

The formal multiword dictionary template will be constitued by:


A set of agreement templates, which will map a set of tag groups into a single tag group.

A set of multiword entries, where each one will have one or more agreement templates.


The dictionaries will be compiled into finite-state transducers (FSTs) before being used by the multiword module, in a similar way the monolingual and bilingual dictionaries are used in current Apertium implementation. A compiler for those dictionaries will also be developed in this project, which will read the XML files and output them in a FST format.


Benefits

Implementing this module and including it in the Apertium engine should have as a direct result the improvement of the translations, since it would enable the engine to correctly process a wider range of sentences. This can speed-up the revision process and, as a consequence, the entire translation work that is made by human translators.


Qualifications

Because of my grad thesis, I’m already familiar with Apertium since I needed to study its code and documentation. Also, I’m very familiar with the prototype code since most of it was done by myself. This would allow more planning during the community bonding period, which will make the coding task easier (or even start the coding during that period).

I also have experience in C++ and XML, which I acquired not only while working with Apertium but also with other works like class projects.

Finally, I’m very interested in research and I see this project as a good theme for it. After (or even during) the development of the module, experiments can be made to see the translation improvement and the results can lead to eventual papers and/or reports. Even though GSOC isn’t directly related to research, the Apertium project is, so I think I can contribute with it in that way too.


Work plan

I just started my master’s which means I don’t have to focus on research yet, only having to attend classes. Currently I’m doing three classes, keeping me busy +/- 10h/week. I don’t have any other commitments for the rest of the Summer.


Community Bonding Period:

Discuss about an appropriate format for the multiword dictionaries.

Analyse the prototype code to find what needs (re-)implementation.

Plan the module and compiler implementation.

Week 1: write the DTD for multiword dictionaries.

Week 2-4: expand the prototype to work with multiwords with more than 2 components.

Deliverable 1: DTD for the multiword dictionaries and the multiword module with analysis/generation implemented with STL maps (as is in the prototype).

Week 5-7: implement the compiler for the multiword dictionaries.

Week 8: test and documentation of the compiler.

Deliverable 2: Compiler for multiword dictionaries.

Week 9-11: implement analysis and generation in the multiword module using FSTs.

Week 12: test and documentation of the multiword module.

Project completed: Multiword module with analysis/generation implemented with FSTs, tested and documented.