Difference between revisions of "User:Fpetkovski/GSoC-2012 Application"
Fpetkovski (talk | contribs) |
Fpetkovski (talk | contribs) |
||
Line 35: | Line 35: | ||
Create a simple n-gram model and see how it performs. |
Create a simple n-gram model and see how it performs. |
||
=== After May 21 === |
|||
Week 1 and 2 |
|||
Deal with definiteness. The task: classify each noun as definite, indefinite or none. |
|||
Example: |
|||
Marica vidi mačku -> Mary sees ''the'' cat. |
|||
Choose features which describe whether a noun should be definite or not. These can include whether the noun was mentioned before, the tags of the surrounding/trailing words and similar context-based information. |
|||
I would start with English first and optimize the model. |
|||
Week 2 and 3 |
|||
Deal with the pronoun + verb problem. In some languages (English for example) verbs do not have gender and/or person, and the a personal pronoun is often (but not always) inserted before the verb. This is a problem when The task: classify each verb as fp, sp or tp and sg or pl. |
|||
Example: |
|||
Marica traži Ivicu. Traži ga ali ... -> Mary is looking for James. ''She'' is looking for him ... |
|||
'''First milestone.''' |
|||
Week 4 and 5: |
|||
Test the models for different languages and make corrections where needed. It is very likely that the models that work for English will probably have to be modified for different language groups. However, it is expected that languages from the group should share similar properties. |
|||
Week 7: |
|||
Deal with reflexive |
|||
== Skills, qualifications and field of study == |
== Skills, qualifications and field of study == |
Revision as of 23:27, 26 March 2012
First name: Filip Last name: Petkovski email: filip.petkovsky@gmail.com fpetkovski on IRC: #apertium
Contents
Why are you interested in machine translation ?
Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.
Why is it that you are interested in the Apertium project?
Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.
Work already done
started the apertium-sh-en language pair in incubator.
created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-rocessor).
created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).
Work to do
Prior to May 21
Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.
Set baselines for definiteness, pronoun insertion, correct preposition, aspect etc.
Create a simple n-gram model and see how it performs.
After May 21
Week 1 and 2 Deal with definiteness. The task: classify each noun as definite, indefinite or none.
Example: Marica vidi mačku -> Mary sees the cat.
Choose features which describe whether a noun should be definite or not. These can include whether the noun was mentioned before, the tags of the surrounding/trailing words and similar context-based information.
I would start with English first and optimize the model.
Week 2 and 3
Deal with the pronoun + verb problem. In some languages (English for example) verbs do not have gender and/or person, and the a personal pronoun is often (but not always) inserted before the verb. This is a problem when The task: classify each verb as fp, sp or tp and sg or pl.
Example: Marica traži Ivicu. Traži ga ali ... -> Mary is looking for James. She is looking for him ...
First milestone. Week 4 and 5: Test the models for different languages and make corrections where needed. It is very likely that the models that work for English will probably have to be modified for different language groups. However, it is expected that languages from the group should share similar properties.
Week 7: Deal with reflexive
Skills, qualifications and field of study
I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.
Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.
I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.
Non-GSoC activities
I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.