Difference between revisions of "User:Fpetkovski/GSoC-2012 Application"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 3: Line 3:
First name: Filip <br />
First name: Filip <br />
Last name: Petkovski <br />
Last name: Petkovski <br />
email: filip.petkovsky@gmail.com <br />
email: filpetkovski@gmail.com <br />
fpetkovski on IRC: #apertium <br />
fpetkovski on IRC: #apertium <br />


Line 14: Line 14:


Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.
Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.

== Why should Google and Apertium sponsor it? ==

Capturing context is an important part of machine translation. It enables fluency and translations which are much closer to human-generated ones. However, it is quite difficult to achieve such a thing by relying solely on rules and a statistical/corpus based approach yields better results for solving this problem.
This project is intended to be an extension to the Apertium platform and will not only improve the existing Serbo-Croatian - Macedonian pair, but also serve as a prototype for implementing similar techniques in some of the other language pairs.




== Which of the published tasks are you interested in? What do you plan to do? ==
== Which of the published tasks are you interested in? What do you plan to do? ==


I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model.
I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model. The existing sh-mk language pair will be used.


== Work already done ==
== Work already done ==


started the apertium-sh-en language pair in ''incubator''.
* started the apertium-sh-en language pair in ''incubator''.

* created a stream-processor for the output of apertium-transfer that reads character by character (/branches/gsoc2012/fpetkovski/stream-processor).


created a stream-processor for the output of apertium-transfer that reads character by character (branches/gsoc2012/fpetkovski/stream-processor).
* created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (/branches/gsoc2012/fpetkovski/stopwords-filter).


created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (branches/gsoc2012/fpetkovski/stopwords-filter).
* started a corpus based classifier (/branches/gsoc2012/fpetkovski/corpus-based-classifier)


== Work to do (Incomplete) ==
== Work to do (Incomplete) ==
Line 35: Line 42:
2b. More about modularity, user friendliness. <br />
2b. More about modularity, user friendliness. <br />
3. Something about training, testing and crossvalidation. <br />
3. Something about training, testing and crossvalidation. <br />
4. How will the baselines be made. <br />
4. <s> How will the baselines be made. </s> <br />
5. More on the n-gram model. <br />
5. <s> More on the n-gram model. </s> <br />


=== The big picture ===
=== The big picture ===
Line 58: Line 65:
* '''Finish the stream-processor so it will parse lemmas and tokens in a data structure.''' This will be useful later for extracting features from lexical units in the stream.
* '''Finish the stream-processor so it will parse lemmas and tokens in a data structure.''' This will be useful later for extracting features from lexical units in the stream.


* Set baselines for definiteness, correct preposition, aspect and genitive type. A simple, and often used, way of setting a baseline is to take the class with the higher prior probability, and assign that label to all the examples in the training set. For the definiteness problem we would take the form of the noun / pronoun that is the most common (say definite), and say that all the nouns / pronouns are definite and measure performance on a testing set. The same method can be used for setting the other three baselines.
* '''Set baselines for definiteness, correct preposition, aspect and genitive type.''' A simple, and often used, way of setting a baseline is to take the class with the higher prior probability, and assign that label to all the examples in the training set. For the definiteness problem we would take the form of the noun / pronoun that is the most common (say definite), and say that all the nouns / pronouns are definite and measure performance on a testing set. The same method can be used for setting the other three baselines.


* '''Create a simple n-gram model and see how it performs.''' The idea is to use only the target side to determine whether a noun is definite. A simple way we can try is to count n-grams (bigrams, trigrams, fourgrams etc.) of the POS tags of the words before a noun/pronoun. Then, a back-off or interpolation technique can be used to determine the definiteness based on the counts in the model. For example, if we have a sequence of <pronoun> <verb> <adjective> (and then <noun>), we would check in the model and take the definiteness tag that a noun which follows that trigram most often has. If that trigram has never appeared in the training set, we would back-off to a bigram model and, in this case, disregard the <pronoun> tag.
* '''Create a simple n-gram model and see how it performs.''' The idea is to use only the target side to determine whether a noun is definite. A simple way we can try is to count n-grams (bigrams, trigrams, fourgrams etc.) of the POS tags of the words before a noun/pronoun. Then, a back-off or interpolation technique can be used to determine the definiteness based on the counts in the model. For example, if we have a sequence of <pronoun> <verb> <adjective> (and then <noun>), we would check in the model and take the definiteness tag that a noun which follows that trigram most often has. If that trigram has never appeared in the training set, we would back-off to a bigram model and, in this case, disregard the <pronoun> tag.
Line 120: Line 127:
Čaša vode. -> A glass ''of'' water <br />
Čaša vode. -> A glass ''of'' water <br />
Čaša moje sestre. -> ''My sister's'' glass <br />
Čaša moje sestre. -> ''My sister's'' glass <br />


=== The classification process ===
The first step is to extract all the possible features one can think of. These include all the tokens surrounding the target word, all of the target word's tags, all of the tags of the surrounding words, is a word capitalized, how many commas are there in a specified window left and right of the target word etc. Once these features are extracted, a tool like RapidMiner or Weka will be used to test the performance of different classifiers on the whole feature set. Feature selection will be done if needed. Once the optimal feature set has been determined, the classification process will be implemented in a separate module and integrated in Apertium.

=== Training, testing and evaluation ===
Training and testing will be done using a k-fold cross validation. The whole data set will be split into two parts, an optimization set and a validation set. The optimization set will be used for optimizing classifier parameters[:)], by making a training and testing set out of it, training the classifier on the training set using one set of parameters and testing it on the testing set.
Once the optimal parameter values have been determined, the performance will be tested on a set that has not been seen by the classifier yet (the validation set).



== Skills, qualifications and field of study ==
== Skills, qualifications and field of study ==
Line 129: Line 145:


== Non-GSoC activities ==
== Non-GSoC activities ==



I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.
I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.

[[Category:GSoC 2012 Student Proposals|Fpetkovski]]

Latest revision as of 18:18, 5 April 2012

Personal Info[edit]

First name: Filip
Last name: Petkovski
email: filpetkovski@gmail.com
fpetkovski on IRC: #apertium

Why are you interested in machine translation ?[edit]

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.


Why is it that you are interested in the Apertium project?[edit]

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. However, there is always more work to be done and being a part of this project is a perfect opportunity to make a big contribution to society.

Why should Google and Apertium sponsor it?[edit]

Capturing context is an important part of machine translation. It enables fluency and translations which are much closer to human-generated ones. However, it is quite difficult to achieve such a thing by relying solely on rules and a statistical/corpus based approach yields better results for solving this problem. This project is intended to be an extension to the Apertium platform and will not only improve the existing Serbo-Croatian - Macedonian pair, but also serve as a prototype for implementing similar techniques in some of the other language pairs.


Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested in building a Corpus-based lexicalised feature transfer module which will set tags based on a corpus-generated model. The existing sh-mk language pair will be used.

Work already done[edit]

  • started the apertium-sh-en language pair in incubator.
  • created a stream-processor for the output of apertium-transfer that reads character by character (/branches/gsoc2012/fpetkovski/stream-processor).
  • created a stream processor for the output of apertium-transfer that removes stop words specified in a dictionary (/branches/gsoc2012/fpetkovski/stopwords-filter).
  • started a corpus based classifier (/branches/gsoc2012/fpetkovski/corpus-based-classifier)

Work to do (Incomplete)[edit]

TO-DO[edit]

1. Put back the Serbo-Croatian - English examples.
2a. Better description of the classification process.
2b. More about modularity, user friendliness.
3. Something about training, testing and crossvalidation.
4. How will the baselines be made.
5. More on the n-gram model.

The big picture[edit]

The idea is to construct separate modules so each of them will deal with setting one lexical feature. The modules can later be combined into a single one where each of them could be turned on using a flag.

The best place to insert the new module would probably be after disambiguation and before lexical transfer so the newly generated tags can be used in transfer. Some adjustments to the existing transfer rules will therefore have to be made.

It is worth noting that this module will not deal with: any type of disambiguation, anaphora resolution,

Week 1 - 4: Make adjustments to the existing resources. Correct some of the rules and add more entries in the dictionaries.
First milestone.
Week 5 and 6: Definiteness.
Week 7 and 8: Preposition selection
Second milestone
Week 9 and 10: Aspect
Week 11 and 12: Possesive / partitive genitive

Prior to May 21[edit]

  • Finish the stream-processor so it will parse lemmas and tokens in a data structure. This will be useful later for extracting features from lexical units in the stream.
  • Set baselines for definiteness, correct preposition, aspect and genitive type. A simple, and often used, way of setting a baseline is to take the class with the higher prior probability, and assign that label to all the examples in the training set. For the definiteness problem we would take the form of the noun / pronoun that is the most common (say definite), and say that all the nouns / pronouns are definite and measure performance on a testing set. The same method can be used for setting the other three baselines.
  • Create a simple n-gram model and see how it performs. The idea is to use only the target side to determine whether a noun is definite. A simple way we can try is to count n-grams (bigrams, trigrams, fourgrams etc.) of the POS tags of the words before a noun/pronoun. Then, a back-off or interpolation technique can be used to determine the definiteness based on the counts in the model. For example, if we have a sequence of <pronoun> <verb> <adjective> (and then <noun>), we would check in the model and take the definiteness tag that a noun which follows that trigram most often has. If that trigram has never appeared in the training set, we would back-off to a bigram model and, in this case, disregard the <pronoun> tag.

After May 21[edit]

Week 1 - 4:
During the creation of the sh-mk language pair, some assumptions were made regarding the grammar of the Macedonian language, and the transfer rules were constructed under those assumptions. Because of that, we get translations like "Хрватската очекува...", meaning "The Croatia is expecting...". Since this project will try to deal with problems like definiteness, the transfer rules need to be changed.

Another issue is the coverage of Croatian. In order for context to be used properly, we need to have as much vocabulary coverage as possible, since the words themselves, and their tags, will be the predominant features.

First milestone.

Week 5 and 6: Deal with definiteness.
Nouns in Serbo-Croatian do not have definiteness, and that feature comes from the context.

Example:
Hrvatska vlada izjavila je ... -> Хрватската влада изјави..

Serbo-Croatian - English example:
Hrvatska vlada izjavila je ... -> The Croatian government said...

Week 7 and 8: Preposition selection.
Different languages use prepositions differently depending on the context, and writing rules for every preposition and every possible situation would be very demanding. The task: for every preposition in the source language, choose the appropriate preposition in the target language.

Example (from the existing sh-mk language pair):
Kapetan je uvijek s tih devetoro mladih pilota -> Капетанот е секогаш од тие деветмина млади пилоти.

The biggest problem here is that the incorrect preposition completely changes the meaning of the sentence. The original sentence says that the Captain is always with the nine young guys, and the translated one says that the Captain is always one of the nine young guys.

English - Serbo-Croatian examples:
Predstava počinje u 3pm -> The show starts at 3pm.
Predstava počinje u utorak -> The show starts on Monday.

I took this from him. -> Ovo sam uzeo od njega
He is from Macedonia. -> On je iz Makedonije.

Second milestone

Week 9 and 10:
Deal with aspect. In some languages, the aspect of the verb is not expressed through inflection and consequently it can not be determined from the verb itself. Some languages, such as English, use auxiliary verbs to express aspect, and some, such as Slavic, use prefixes. The task: Classify each verb as perfective or imperfective (or progressive).

Example (from the existing sh-mk pair):
Ako trema nestane... -> Ako тремата исчезне...
Trema nestane kada ... -> Тремата исчезнува кога...

English - Serbo-Croatian example:
Igrao sam nogomet 3 puta. -> I have played football three times.
Igrao sam nogomet jučer -> I played football yesterday.


Week 11 and 12:Deal with posesive / partitive genitive
Depending on the language, specific varieties of genitive-noun–main-noun relationships may include possession, composition, origin etc. A problem arises because of the lack of case in Macedonian because the genitive-noun main-noun combinations is translated differently depending on the relationship it describes.

Example:
Čaša vode. -> Чаша со вода.
Čaša moje sestre. -> Чашата на мојата сестра.

English - Serbo-Croatian example:
Čaša vode. -> A glass of water
Čaša moje sestre. -> My sister's glass


The classification process[edit]

The first step is to extract all the possible features one can think of. These include all the tokens surrounding the target word, all of the target word's tags, all of the tags of the surrounding words, is a word capitalized, how many commas are there in a specified window left and right of the target word etc. Once these features are extracted, a tool like RapidMiner or Weka will be used to test the performance of different classifiers on the whole feature set. Feature selection will be done if needed. Once the optimal feature set has been determined, the classification process will be implemented in a separate module and integrated in Apertium.

Training, testing and evaluation[edit]

Training and testing will be done using a k-fold cross validation. The whole data set will be split into two parts, an optimization set and a validation set. The optimization set will be used for optimizing classifier parameters[:)], by making a training and testing set out of it, training the classifier on the training set using one set of parameters and testing it on the testing set. Once the optimal parameter values have been determined, the performance will be tested on a set that has not been seen by the classifier yet (the validation set).


Skills, qualifications and field of study[edit]

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimizing a model, feature selection and feature extraction for classification.

I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

Non-GSoC activities[edit]

I have final exams at the beginning of June, but I will be able to work more than 30 hours / week.