User:Milos/Application

From Apertium
Jump to navigation Jump to search

Hello,

My name is Milos Stanojevic and I am interested in working on Apertium during the GSoC (and maybe after it). Right now I am student of first year on European Masters in Language and Communication Technology at University of Malta (and next year at Charles University in Prague). Before that I finished my Bachelor studies in Computer Science at University of Nis. I am interested in Machine Translation since last semester where I had a course in Statistical Machine Translation held by professor Stephan Vogel from Carnegie Mellon University. Under his supervision I worked on English-Maltese machine translation system which translates by combining two translation systems. First one is "normal" SMT which we trained on relatively small JRC corpus and second is transliteration system which uses Italian as a pivot language because Italian and Maltese have 40% of vocabulary overlap. That work is now submitted to EMNLP conference. My first experience with Apertium happened recently after receiving assignment from lecturer in Computational Morphology prof Aarne Ranta (creator of Grammatical Framework) to create converter for converting Apertium lexicon to Grammatical Framework lexicon.

Where is this convertor ? How come you didn't let us know you were working on it ? (we would have loved to have helped in any way we could) - Francis Tyers 11:43, 5 April 2011 (UTC)

I think that rule based systems can improve itself if they include machine learning methods in some parts of the process and if they allow non-expert users to extend rules or lexicon. In my opinion list of GSoC ides for Apertium show that Apertium is going into right direction.

Programming languages in which I am fluent:
-Scala - my Bachelors theses was on optimizing Scala programs
-Perl - done a lot of web programming and text processing in it
-Java - most projects I done in it
-C++/C - I won second place twice at Elektrijada competition in programming in C/C++
-I know many other languages but not as good as the ones mentioned above.
From databases I have experience working with MySQL, PostgreSQL and HSQL.
I know many algorithms for nlp and machine learning (HMM, EM, Naive Bayes...).
I also know how to use tools for nlp like:
-giza++
-mgiza++
-srilm
-moses

I am a native speaker of Serbian (Serbo-Croatian, Croatian, Montenegrian...), fluent speaker of English (my studies are in English) and I know just a little of Russian. I live in the south of Serbia where population is speaking dialect of Serbian which is very similar to Macedonian so you can say that I know a little of Macedonian too (I can understand almost everything).


I am interested in a few GSoC project ideas that are offered by Apertium.

1) Accent and diacritic restoration As a native speaker of Serbian which has a few characters with diacritics I understand importance of solving this problem. I read papers that you linked on the wiki and I think I know in general what should be done. I have two ideas to propose (I don't know if they are good). For the LL2 method (described in Statistical Unicodification of African Languages) maybe it is better to use skip-bigrams and increasing the window in which we compute the probability of a sequence. Basically I suggest using Naive Bayes classifier for supervised Word Sense Disambiguation as described in Statistical NLP book by Manning because every different possibility of the right word can be treated as one of the possible senses. Also by using skip-bigrams instead of normal bigrams we get lower discrimination but larger reliability. Second suggestion is using transliteration techniques like SMT training on letter as a word and word as a sentence for repairing diacritics lost in non-Latin alphabets when they are written in Latin.


2)Active learning to choose among paradigms which share superficial forms. Like it is already explained in paper that is linked in description for this project (Enlarging Monolingual Dictionaries for Machine Translation with Active Learning and Non-Expert Users) non-expert users can be a big help in constructing dictionaries for Apertium. One idea how to solve problem with different paradigms that share the same lexical forms is to use tagged corpora if it is available and from there take examples for both occurrences of the same lexical but with different paradigms.

3)Dictionary induction from wikis. By working on converter from Apertium lexicon to Grammatical Framework lexicon I saw that many existing linguistic resources can be reused. I think I can make contribution in this project.

4)I am also interested in Hybrid MT but I still do not understand enough Marclator and Apertium to estimate how much time I'll need or exact way of implementing solution but I am very interested to apply my skills in this project and learn how different kinds of MT systems can be combined.

5) Adding support for Serbian (Serbo-Croatian, Croatian, Montenegrian...) and its translation to possibly Macedonian. I know that work on this was already started by someone and I might continue that work.

I am not sure if this is too ambitious but I think that there will probably be enough time in three months to finish one project from the first four and then make contribution to Serbo-Croatian - Macedonian translation system.

You should choose just one project and make a really nice app. Unfortunately you don't have much time left. - Francis Tyers 11:43, 5 April 2011 (UTC)

From all dates during which I am supposed to work on GSoC project I will be unavailable to work between 22-26 of May because then I need to travel to University of Nancy 2 where is a meeting to which all students who receive Erasmus Mundus scholarship have to be present. Also I cannot work 4 days more for which I don't know exact dates right now. I plan to work additionally for these days before or after the period reserved for GSoC.

This shouldn't be a problem. I recommend that you come on IRC asap. - Francis Tyers 11:43, 5 April 2011 (UTC)

Regards, Milos Stanojevic