User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

From Apertium
Jump to navigation Jump to search

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here here.

Personal Info

First name: Filip
Last name: Petkovski
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

Why should Google and Apertium sponsor it?

Lexical selection is the task of deciding which word to use in a given context. A good lexical selection module can significantly increase translation quality, and give machine translation a more human-like feel.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Improving the lexical selection module.

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

Work already done

  • Generate lexical selection rules from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Generate additional bidix entries from a parallel corpus for the sh-mk language pair (submitted on svn)
  • Last GSoC's participant (Corpus based feature transfer)

Work to do

After May 21 Week 1 - 4: During the creation of the sh-mk language pair, some assumptions were made regarding the grammar of the Macedonian language, and the transfer rules were constructed under those assumptions. Because of that, we get translations like "Хрватската очекува...", meaning "The Croatia is expecting...". Since this project will try to deal with problems like definiteness, the transfer rules need to be changed. Another issue is the coverage of Croatian. In order for context to be used properly, we need to have as much vocabulary coverage as possible, since the words themselves, and their tags, will be the predominant features. First milestone. Week 5 and 6: Deal with definiteness. Nouns in Serbo-Croatian do not have definiteness, and that feature comes from the context. Example: Hrvatska vlada izjavila je ... -> Хрватската влада изјави.. Serbo-Croatian - English example: Hrvatska vlada izjavila je ... -> The Croatian government said... Week 7 and 8: Preposition selection. Different languages use prepositions differently depending on the context, and writing rules for every preposition and every possible situation would be very demanding. The task: for every preposition in the source language, choose the appropriate preposition in the target language. Example (from the existing sh-mk language pair): Kapetan je uvijek s tih devetoro mladih pilota -> Капетанот е секогаш од тие деветмина млади пилоти. The biggest problem here is that the incorrect preposition completely changes the meaning of the sentence. The original sentence says that the Captain is always with the nine young guys, and the translated one says that the Captain is always one of the nine young guys. English - Serbo-Croatian examples: Predstava počinje u 3pm -> The show starts at 3pm. Predstava počinje u utorak -> The show starts on Monday. I took this from him. -> Ovo sam uzeo od njega He is from Macedonia. -> On je iz Makedonije. Second milestone Week 9 and 10: Deal with aspect. In some languages, the aspect of the verb is not expressed through inflection and consequently it can not be determined from the verb itself. Some languages, such as English, use auxiliary verbs to express aspect, and some, such as Slavic, use prefixes. The task: Classify each verb as perfective or imperfective (or progressive). Example (from the existing sh-mk pair): Ako trema nestane... -> Ako тремата исчезне... Trema nestane kada ... -> Тремата исчезнува кога... English - Serbo-Croatian example: Igrao sam nogomet 3 puta. -> I have played football three times. Igrao sam nogomet jučer -> I played football yesterday.

Week 11 and 12:Deal with posesive / partitive genitive Depending on the language, specific varieties of genitive-noun–main-noun relationships may include possession, composition, origin etc. A problem arises because of the lack of case in Macedonian because the genitive-noun main-noun combinations is translated differently depending on the relationship it describes. Example: Čaša vode. -> Чаша со вода. Čaša moje sestre. -> Чашата на мојата сестра. English - Serbo-Croatian example: Čaša vode. -> A glass of water Čaša moje sestre. -> My sister's glass

TODO list:

  • Merge the four different implementations of irstlm_ranker into a single implementation
  • Move lex-learner to lex-tools
  • Script/program for finding possibly missing bidix entries from an aligned parallel corpus.
  • Do proper processing of tags in all scripts.
  • Remove unused and redundant scripts.
  • Work on a way to trim non-significant features from the maximum-entropy models.
  • Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
  • Make sure that capitalisation, any tag and any character work as expected.
  • Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < > make sure that <match lemma="*" tags="*"/> works the same as <match/>
  • Update the instructions on the wiki

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification. I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.

Non-GSoC activities

My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).

I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.