User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

From Apertium
Jump to navigation Jump to search

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here:

TODO list:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • Move lex-learner to lex-tools
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Do proper processing of tags in all scripts.
  • Remove unused and redundant scripts.
  • Work on a way to trim non-significant features from the maximum-entropy models.
  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < >
  • make sure that <match lemma="*" tags="*"/> works the same as <match/>
  • Update the instructions on the wiki

Personal Info

First name: Filip
Last name: Petkovski
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

Why should Google and Apertium sponsor it?

Lexical selection is the task of deciding which word to use in a given context. A good lexical selection module can significantly increase translation quality, and give machine translation a more human-like feel.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Improving the lexical selection module.

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

Work already done

  • Generate lexical selection rules for the sh-mk language pair (submitted on svn)
  • Generate additional bidix entries for the sh-mk language pair (submitted on svn)
  • Last GSoC's participant (Corpus based feature transfer)

Work to do

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Non-GSoC activities