User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

From Apertium
Jump to navigation Jump to search

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here:

TODO list:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • Move lex-learner to lex-tools
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Do proper processing of tags in all scripts.
  • Remove unused and redundant scripts.
  • Work on a way to trim non-significant features from the maximum-entropy models.
  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < >
  • make sure that <match lemma="*" tags="*"/> works the same as <match/>