Difference between revisions of "User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module"

From Apertium
Jump to navigation Jump to search
Line 39: Line 39:
'''Community bonding period:'''
'''Community bonding period:'''
* go through training process for monolingual rule extraction
* go through the training process for monolingual rule extraction
* go through tranining process for MaxEnt rule extraction (monolingual/parallel)
* go through the tranining process for MaxEnt rule extraction (monolingual/parallel)
* document the results
* document the results

Revision as of 11:46, 30 April 2013

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here here.

Personal Info

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

Why should Google and Apertium sponsor it?

Lexical selection is the task of deciding which word to use in a given context. A good lexical selection module can significantly increase translation quality, and give machine translation a more human-like feel.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Improving the lexical selection module.

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

Work already done

  • Generate lexical selection rules from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Generate additional bidix entries from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Last GSoC's participant (Corpus based feature transfer)

Work to do

Community bonding period:

  • go through the training process for monolingual rule extraction
  • go through the tranining process for MaxEnt rule extraction (monolingual/parallel)
  • document the results

Week 1:

  • Update the instructions on the wiki
  • Remove unused and redundant scripts.
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.

Week 2:

  • Do proper processing of tags in all scripts.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < > make sure that <match lemma="*" tags="*"/> works the same as <match/>

Week 3:

  • Move lex-learner to lex-tools

Week 4-5:

  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.

Week 6-7:

  • Merge the four different implementations of irstlm_ranker into a single implementation

Weeks 8-10:

  • Work on a way to trim non-significant features from the maximum-entropy models.

Week 11:

  • Apply the model to different language pairs and generate lexical selection rules and bidix entries.

Week 12:

  • Wrap up / documentation

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification. I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.

Non-GSoC activities

My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).

I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.