User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

From Apertium
Jump to navigation Jump to search

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here here.

Personal Info

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

Why should Google and Apertium sponsor it?

Lexical selection is the task of deciding which word to use in a given context. A good lexical selection module can significantly increase translation quality, and give machine translation a more human-like feel.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Improving the lexical selection module.

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

Work already done

  • Generate lexical selection rules for the sh-mk language pair (submitted on svn)
  • Generate additional bidix entries for the sh-mk language pair (submitted on svn)
  • Last GSoC's participant (Corpus based feature transfer)

Work to do

TODO list:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • Move lex-learner to lex-tools
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Do proper processing of tags in all scripts.
  • Remove unused and redundant scripts.
  • Work on a way to trim non-significant features from the maximum-entropy models.
  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < >

make sure that <match lemma="*" tags="*"/> works the same as <match/>

  • Update the instructions on the wiki

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.


Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification. I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.

Non-GSoC activities

My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).

I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.