Difference between revisions of "User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
 
* make sure that <match lemma="*" tags="*"/> works the same as <match/>
 
* make sure that <match lemma="*" tags="*"/> works the same as <match/>
 
* Update the instructions on the wiki
 
* Update the instructions on the wiki
  +
  +
== Personal Info ==
  +
  +
First name: Filip <br />
  +
Last name: Petkovski <br />
  +
email: filpetkovski@gmail.com <br />
  +
fpetkovski on IRC: #apertium <br />
  +
  +
== Why are you interested in machine translation ? ==
  +
  +
  +
== Why is it that you are interested in the Apertium project? ==
  +
  +
  +
== Why should Google and Apertium sponsor it? ==
  +
  +
  +
== Which of the published tasks are you interested in? What do you plan to do? ==
  +
  +
== Work already done ==
  +
  +
== Work to do ==
  +
  +
  +
== Skills, qualifications and field of study ==
  +
  +
== Non-GSoC activities ==
   
   

Revision as of 09:26, 30 April 2013

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module.

TODO list:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • Move lex-learner to lex-tools
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Do proper processing of tags in all scripts.
  • Remove unused and redundant scripts.
  • Work on a way to trim non-significant features from the maximum-entropy models.
  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < >
  • make sure that <match lemma="*" tags="*"/> works the same as <match/>
  • Update the instructions on the wiki

Personal Info

First name: Filip
Last name: Petkovski
email: filpetkovski@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Why is it that you are interested in the Apertium project?

Why should Google and Apertium sponsor it?

Which of the published tasks are you interested in? What do you plan to do?

Work already done

Work to do

Skills, qualifications and field of study

Non-GSoC activities