User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module
The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.
The project idea is located here: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module.
- Merge the four different implementations of irstlm_ranker into a single implementation
- Move lex-learner to lex-tools
- Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
- Do proper processing of tags in all scripts.
- Remove unused and redundant scripts.
- Work on a way to trim non-significant features from the maximum-entropy models.
- Rewrite the
LRXProcessor::processmethods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
- Make sure that capitalisation, any tag and any character work as expected.
- Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < >
- make sure that <match lemma="*" tags="*"/> works the same as <match/>
- Update the instructions on the wiki
- 1 Personal Info
- 2 Why are you interested in machine translation ?
- 3 Why is it that you are interested in the Apertium project?
- 4 Why should Google and Apertium sponsor it?
- 5 Which of the published tasks are you interested in? What do you plan to do?
- 6 Work already done
- 7 Work to do
- 8 Skills, qualifications and field of study
- 9 Non-GSoC activities
First name: Filip
Last name: Petkovski
fpetkovski on IRC: #apertium
Why are you interested in machine translation ?
Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.