Difference between revisions of "User:Fpetkovski/GSoC-2012 Application"

From Apertium
Jump to navigation Jump to search
Line 29: Line 29:


== Work to do (Incomplete) ==
== Work to do (Incomplete) ==

=== TO-DO ===
1. Put back the Serbo-Croatian - English examples.
2a. Better description of the classification process.
2b. More about modularity, user friendliness.
3. Something about training, testing and crossvalidation.
4. How will the baselines be made.
5. More on the n-gram model.


=== The big picture ===
=== The big picture ===

Revision as of 22:42, 29 March 2012

Personal Info

First name: Filip
Last name: Petkovski
email: filip.petkovsky@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.


Why is it that you are interested in the Apertium project?

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.


Which of the published tasks are you interested in? What do you plan to do?

I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.

Work already done

started the apertium-sh-en language pair in incubator.

created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-processor).

created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).

Work to do (Incomplete)

TO-DO

1. Put back the Serbo-Croatian - English examples. 2a. Better description of the classification process. 2b. More about modularity, user friendliness. 3. Something about training, testing and crossvalidation. 4. How will the baselines be made. 5. More on the n-gram model.

The big picture

The idea is to construct separate modules so each of them will deal with setting one lexical feature. The modules can later be combined into a single one where each of them could be turned on using a flag.

The best place to insert the new module would probably be after disambiguation and before lexical transfer so the newly generated tags can be used in transfer. Some adjustments to the existing transfer rules will therefore have to be made.

It is worth noting that this module will not deal with: any type of disambiguation, anaphora resolution,

Week 1 - 4: Make adjustments to the existing resources. Correct some of the rules and add more entries in the dictionaries.
First milestone.
Week 5 and 6: Definiteness.
Week 7 and 8: Preposition selection
Second milestone
Week 9 and 10: Aspect
Week 11 and 12: Possesive / partitive genitive

Prior to May 21

Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.

Set baselines for definiteness, pronoun insertion, correct preposition, aspect etc.

Create a simple n-gram model and see how it performs.

After May 21

Week 1 - 4:
During the creation of the sh-mk language pair, some assumptions were made regarding the grammar of the Macedonian language, and the transfer rules were constructed under those assumptions. Because of that, we get translations like "Хрватската очекува...", meaning "The Croatia is expecting...". Since this project will try to deal with problems like definiteness, the transfer rules need to be changed.

Another issue is the coverage of Croatian. In order for context to be used properly, we need to have as much vocabulary coverage as possible, since the words themselves, and their tags, will be the predominant features.

First milestone.

Week 5 and 6: Deal with definiteness.
Nouns in Serbo-Croatian do not have definiteness, and that feature comes from the context.

Example:
Hrvatska vlada izjavila je ... -> Хрватската влада изјави..

Week 7 and 8: Preposition selection.
Different languages use prepositions differently depending on the context, and writing rules for every preposition and every possible situation would be very demanding. The task: for every preposition in the source language, choose the appropriate preposition in the target language.

Example (from the existing sh-mk language pair):
Kapetan je uvijek s tih devetoro mladih pilota -> Капетанот е секогаш од тие деветмина млади пилоти.

The biggest problem here is that the incorrect preposition completely changes the meaning of the sentence. The original sentence says that the Captain is always with the nine young guys, and the translated one says that the Captain is always one of the nine young guys.

Second milestone

Week 9 and 10:
Deal with aspect. In some languages, the aspect of the verb is not expressed through inflection and consequently it can not be determined from the verb itself. Some languages, such as English, use auxiliary verbs to express aspect, and some, such as Slavic, use prefixes. The task: Classify each verb as perfective or imperfective (or progressive).

Example (from the existing sh-mk pair):
Ako trema nestane... -> Ako тремата исчезне...
Trema nestane kada ... -> Тремата исчезнува кога...

Week 11 and 12:Deal with posesive / partitive genitive
Depending on the language, specific varieties of genitive-noun–main-noun relationships may include possession, composition, origin etc. A problem arises because of the lack of case in Macedonian because the genitive-noun main-noun combinations is translated differently depending on the relationship it describes.

Example:
Čaša vode. -> Чаша со вода.
Čaša moje sestre. -> Чашата на мојата сестра.

Skills, qualifications and field of study

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.

I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

Non-GSoC activities

I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.