User:Fpetkovski/GSoC-2012 Application

From Apertium
Jump to navigation Jump to search

Personal Info

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.


Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.


Which of the published tasks are you interested in? What do you plan to do?

I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.

Work already done

started the apertium-sh-en language pair in incubator.

created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-processor).

created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).

Work to do (Incomplete)

The big picture

The idea is to construct separate modules so each of them will deal with setting one lexical feature. The modules can later be combined into a single one where each of them could be turned on using a flag.

The best place to insert the new module would probably be after disambiguation and before lexical transfer so the newly generated tags can be used in transfer. Some adjustments to the existing transfer rules will therefore have to be made.

It is worth noting that this module will not deal with: any type of disambiguation, anaphora resolution,

Week 1 and 2: Making adjustments in the existing resources.
Week 3 and 4: Definiteness.
First milestone.
Week 5 and 6: Preposition selection
Week 7 and 8:
Second milestone Week 9 and 10: Aspect
Week 11 and 12: Reflexive verbs

Prior to May 21

Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.

Set baselines for definiteness, pronoun insertion, correct preposition, aspect etc.

Create a simple n-gram model and see how it performs.

After May 21

Week 1 and 2:
Deal with definiteness. The task: classify each noun as definite, indefinite or none.

Example:
Marica vidi mačku -> Mary sees the cat.
but also,
Marica vidi mačku -> Mary sees a cat.

Choose features which best describe whether a noun should be definite, indefinite or neither. These can include whether the noun was mentioned before, the tags of the surrounding/trailing words and similar context-based information. I would start with English first and later try to apply it to different languages / language groups.

Week 2 and 3:
Deal with the pronoun + verb problem. In some languages (English for example) verbs do not have gender and/or person, and a personal pronoun is often (but not always) inserted before the verb. This is a problem when the source language is of this type and the gender / person of the verb can not be discovered by analysis. The task: classify each verb as fp, sp or tp and sg or pl.

Example:
Marica traži Ivicu. Traži ga ali ... -> Mary is looking for James. She is looking for him but ...

First milestone.

Week 4 and 5:
Test the models for different language groups and make corrections where needed. It is very likely that the models that work for English will probably have to be modified before it can be used with different language groups. However, it is expected that languages from same the group should share similar properties.

Week 6:
Deal with preposition selection. Different languages use prepositions differently depending on the context, and writing rules for every preposition and every possible situation would be very demanding. The task: for every preposition in the source language, choose the appropriate preposition in the target language.

Example:
I came from school. -> Došao sam iz škole.
I took this from him. -> Ovo sam uzeo od njega.

Second milestone

Week 7 and 8:
Deal with aspect. In some languages, the aspect of the verb is not expressed through inflection and consequently it can not be determined from the verb itself. Some languages, such as English, use auxiliary verbs to express aspect, and some, such as Slavic, use prefixes. The task: Classify each verb as perfective or imperfective (or progressive).

Example:
Igrao je nogomet jučer. -> He played football yesterday
Igrao je nogomet 3 puta. -> He has played football 3 times

Week 8 and 9:
Deal with reflexive verbs. Verbs that can be reflexive in one language do not have to be in another. The task: classify each verb as reflexive or not reflexive.

Example:
Igrao sam se jučer -> I was playing yesterday.
Igrao sam nogomet jučer -> I played football yesterday.

Week 10 and 11:
Implementation in Apertium

Week 12:
Documentation.

-- Yet to be finished --

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.

I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

Non-GSoC activities

I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.