First name: Filip Last name: Petkovski email: firstname.lastname@example.org fpetkovski on IRC: #apertium
- 1 Why are you interested in machine translation ?
- 2 Why is it that you are interested in the Apertium project?
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 Work already done
- 5 Work to do
- 6 Skills, qualifications and field of study
- 7 Non-GSoC activities
Why are you interested in machine translation ?
Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.
Why is it that you are interested in the Apertium project?
Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.
Work already done
started the apertium-sh-en language pair in incubator.
created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-rocessor).
created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).
Work to do
Prior to May 21
Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.
Set baselines for definiteness, pronoun insertion, correct preposition, aspect etc.
Create a simple n-gram model and see how it performs.
After May 21
Week 1 and 2:
Deal with definiteness. The task: classify each noun as definite, indefinite or none.
Marica vidi mačku -> Mary sees the cat.
Choose features which describe whether a noun should be definite or not. These can include whether the noun was mentioned before, the tags of the surrounding/trailing words and similar context-based information. I would start with English first and optimize the model.
Week 2 and 3:
Deal with the pronoun + verb problem. In some languages (English for example) verbs do not have gender and/or person, and the a personal pronoun is often (but not always) inserted before the verb. This is a problem when The task: classify each verb as fp, sp or tp and sg or pl.
Marica traži Ivicu. Traži ga ali ... -> Mary is looking for James. She is looking for him ...
Week 4 and 5:
Test the models for different languages and make corrections where needed. It is very likely that the models that work for English will probably have to be modified for different language groups. However, it is expected that languages from the group should share similar properties.
Week 7 and 8:
Deal with case. Nouns in languages like English and Macedonian do not have case, and their grammatical function is determined from the context. The task: for these types of languages train a model that will determine the case of each noun and pronoun.
I saw James. -> Vidio sam Ivicu.
Week 9: Deal with aspect.
... Week 11 and 12:
Implementation in Apertium
Week 12: Documentation:
Skills, qualifications and field of study
I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.
Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.
I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.
I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.