User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

From Apertium
< User:Fpetkovski
Revision as of 10:09, 8 September 2013 by Fpetkovski (talk | contribs) (→‎Work to do)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here here.

Personal Info[edit]

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?[edit]

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?[edit]

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

Why should Google and Apertium sponsor it?[edit]

Lexical selection is the task of deciding which word to use in a given context. A good lexical selection module can significantly increase translation quality, and give machine translation a more human-like feel.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested in Improving the lexical selection module.

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

Work already done[edit]

  • Generate lexical selection rules from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Generate additional bidix entries from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Last GSoC's participant (Corpus based feature transfer)

Work to do[edit]

Community bonding period:

  • go through the training process for monolingual rule extraction
  • go through the tranining process for MaxEnt rule extraction (monolingual/parallel)
  • document the results

Week 1:

  • Update the instructions on the wiki
  • Remove unused and redundant scripts. (prefixed with unused. )
  • Do proper processing of tags in all scripts. (fixed with FSTProcessor::biltransWithoutQueue)
  • Fix tokenization. (fixed in scripts/common.py with tokenize_biltrans_line)
  • Make sure that capitalisation, any tag and any character work as expected (fixed in tokenization).
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < > (fixed with tokenization)

Week 2:

  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Make sure that <match lemma="*" tags="*"/> works the same as <match/>
  • <match/> doesn't match an LU when the lemma is ,
  • Fix bug10 in the testing dir.

Week 3:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • add option to the ranker which marks translations which fall outside of xx% of the probability mass for a given sentence |@| |+| |-|
  • Move lex-learner to lex-tools
  • Run through and document new training process with a language pair (mk-en, br-fr, or en-es)
  • Demonstrate bidix extraction script with a language pair (e.g. es-pt)

Week 4-6:

  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Work on a way to trim non-significant features from the maximum-entropy models.
    • probability mass: discard features which fall outside of xx% of the probability mass, e.g. 80%, should be configurable
    • outcome pruning: discard features that select a translation which can never win: e.g. the sum of the weights of all the contexts where it appears never adds up to more than the sum of the weights of all the other translations
  • Implement poor-man's alignment: instead of using giza++, use tagged corpora and look up to see if the equivalent word appears.

Weeks 7-9:

  • ...

Week 9-10:

  • Apply the model to different language pairs and generate lexical selection rules and bidix entries.
    • eu-es, es-fr, es-pt, mk-en, br-fr, en-es

Week 11-12:

  • Wrap up / writing paper

Skills, qualifications and field of study[edit]

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.


Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification. I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.


I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.

Non-GSoC activities[edit]

My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).

I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.