Difference between revisions of "User:Fpetkovski/GSoC-2012 Application"

From Apertium
Jump to navigation Jump to search
Line 28: Line 28:
created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).
created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).
== Work to do ==
== Work to do (Incomplete) ==
=== Prior to May 21 ===
=== Prior to May 21 ===

Revision as of 11:27, 27 March 2012

Personal Info

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.

Work already done

started the apertium-sh-en language pair in incubator.

created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-rocessor).

created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).

Work to do (Incomplete)

Prior to May 21

Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.

Set baselines for definiteness, pronoun insertion, correct preposition, aspect etc.

Create a simple n-gram model and see how it performs.

After May 21

Week 1 and 2:
Deal with definiteness. The task: classify each noun as definite, indefinite or none.

Marica vidi mačku -> Mary sees the cat.

Choose features which describe whether a noun should be definite, indefinite or neither. These can include whether the noun was mentioned before, the tags of the surrounding/trailing words and similar context-based information. I would start with English first and optimize the model.

Week 2 and 3:
Deal with the pronoun + verb problem. In some languages (English for example) verbs do not have gender and/or person, and the a personal pronoun is often (but not always) inserted before the verb. This is a problem when the source language is of this type. The task: classify each verb as fp, sp or tp and sg or pl.

Marica traži Ivicu. Traži ga ali ... -> Mary is looking for James. She is looking for him ...

First milestone.

Week 4 and 5:
Test the models for different languages and make corrections where needed. It is very likely that the models that work for English will probably have to be modified for different language groups. However, it is expected that languages from the group should share similar properties.

Week 7:
Deal with preposition selection. Different languages use prepositions differently depending on the context, and writing rules for every preposition and every possible situation would be very demanding. The task: for every preposition in the source language, choose the appropriate preposition in the target language.

I came from school. -> Došao sam iz škole.
I took this from him. -> Ovo sam uzeo od njega.

Second milestone

Week 8 and 9: Deal with aspect.

Week 9 and 10: Deal with reflexive / impersonal verbs.

Week 11 and 12:
Implementation in Apertium

Week 12:

-- Yet to be finished --

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.

I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

Non-GSoC activities

I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.