Difference between revisions of "User:Zfe/Application"

From Apertium
Jump to navigation Jump to search
Line 28: Line 28:
   
 
#We need a morph analyzer for Azerbaijani: I'm already working on it. It is called azmorph, it can be downloaded from Apertium SVN. I'm developing it starting from TRmorph, because working this way I can develop it much faster. At the moment it has a working vowel harmony, it can conjugate present aorist (which is the only present tense Azerbaijani has) affirmative and negative, it can handle cases and other noun inflections like plural suffixes, comitative/instrumental. You can easily check what azmorph is already able to do running the script you will find in the azmorph directory.
 
#We need a morph analyzer for Azerbaijani: I'm already working on it. It is called azmorph, it can be downloaded from Apertium SVN. I'm developing it starting from TRmorph, because working this way I can develop it much faster. At the moment it has a working vowel harmony, it can conjugate present aorist (which is the only present tense Azerbaijani has) affirmative and negative, it can handle cases and other noun inflections like plural suffixes, comitative/instrumental. You can easily check what azmorph is already able to do running the script you will find in the azmorph directory.
  +
#We need a bidix dictionary: for closed categories I will proceed manually. For other categories we already have a csv file with ~3000words. For many words changes between tr and az are fixed<ref>http://azer.com/aiweb/categories/magazine/13_folder/13_articles/kurtulush_azeri_turkish_13.pdf</ref>, so that we can work over it with a script, to produce any possible az candidate for any given turkish word.
#We need a bidix dictionary: for closed categories I will proceed manually.
 
  +
#We need translation rules, which I will compile personally.
  +
#We need to disambiguate many idiomatics which have different (or opposite) meaning in the two languages we are talking about (it's the case for example of hoş geldiniz, used as welcome in Turkish and as goodbye in Azerbaijani), and I've got some resources to work on it.
  +
  +
===Detailed work plan===
  +
  +
Weeks from 1 to 6: I'd like to work on dictionary and azmorph together. With scripts I think it is possible to add about 500 words per day, considering that at the same time I'll need to work on azmorph. I expect to have a bidix with 20.000 words by the end of the sixth week. At the same time I expect to have a full functioning azmorph, able to handle any particularity of Azerbaijani and to be at least on a beta release level.
  +
Weeks 7-8: Translation and transfer rules. Even though languages are really similar, there are many differences, especially in style. Even verbal moods are not 100% coinciding, so that some work will be needed, especially on the turkish side, to handle differences properly and have a good translation and not just a similar translation.
  +
Weeks 9-10-11: Heavy Testing and bug fix. By this time I expect to have a ~20.000 words dictionary, a beta quality azmorph and a full set of translation and transfer rules. So that, having a complete environment it is time to move to testing and bug fix. I'm planning to use different corpora (METU, setimes, wikipedia) to check missing words in bidix, manually fix rules, fix bugs I'll surely find in azmorph/rules/bidix.
  +
Week 12: Final clean-up, everything should be fixed by this time. This week should be left for minor fixing and to handle what was left as "TODO if you have time".
  +
  +
===Tools and references I will be using===

Revision as of 08:21, 5 April 2011

Who!?

Name: Gianluca Grossi email: me@ggrossi.com irc: zfe @ freenode other contacts: skype: giagrossi

Why is it you are interested in machine translation?

I've met Apertium project the first time @ freenode, being a regular on #linguistics. Even though I'm a Law student I've always had interest in programming and linguistics. I had a solid education in linguistics, especially at high school, where I've been taught both Latin and Ancient Greek, and since then I kept a strong interest in linguistics related matters. I'm interested in machine translation because I am a huge fan of automation, especially when it comes at typically-human tasks, like translation. Writing rules, morphological analyzers and transfer rules is a really challenging process from my mind and it keeps me from getting bored.

Why is it that they are interested in the Apertium project?

As I said before, I'm really interested in programming and linguistics. Apertium provides me with a free software, the chance to develop something that could be used, rewritten, modified by anybody for any kind of purpose, which makes me even more enthusiastic about the the possibility of creating a new language pair, since my work will be probably reused by somebody else for really different purposes. Apertium lacks of a Turkic language to Turkic language pair and being a huge fan of Turkic languages I think that it would be worth trying. Given the similarity of the languages I'd like to use, Apertium is the right environment to create such a language pair. In addition to that, I had the chance to interact with Apertium community members and it is an environment I really like, they look both knowledgeable and willing to help new members like me.

Which of the published tasks are you interested in? What do you plan to do?

Apertium-tr-az: machine translation between Turkish and Azerbaijani (a savage journey to the heart of the Turanist dream).

Why should Google and Apertium sponsor it? How and who will it benefit in society?

Apertium doesn't have any turkic-pair on release quality level. Turkish is the most widely spoken turkic language, with 80M speakers. On the other hand Azerbaijani has some 20M speakers, 8M if we consider just the Northern variant, which is the official language of Azerbaijan, with 12M people living mostly in Iran without having their language recognized as official language of the country where they live. For this reason, if compared to Turkish ones, there are few resources available in Azerbaijani, especially when it comes about educational tools. Aiming at a good result, I think it would be possible (and useful) to provide Azerbaijani native speakers with the chance to have resources in Turkish automatically translated in their native language. In addition to that it will be necessary for my project to develop a morph analyzer for Azerbaijani, which could be reused in future for other language pairs involving Azerbaijani.


Work Plan

What needs to be done

  1. We need a morph analyzer for Azerbaijani: I'm already working on it. It is called azmorph, it can be downloaded from Apertium SVN. I'm developing it starting from TRmorph, because working this way I can develop it much faster. At the moment it has a working vowel harmony, it can conjugate present aorist (which is the only present tense Azerbaijani has) affirmative and negative, it can handle cases and other noun inflections like plural suffixes, comitative/instrumental. You can easily check what azmorph is already able to do running the script you will find in the azmorph directory.
  2. We need a bidix dictionary: for closed categories I will proceed manually. For other categories we already have a csv file with ~3000words. For many words changes between tr and az are fixed[1], so that we can work over it with a script, to produce any possible az candidate for any given turkish word.
  3. We need translation rules, which I will compile personally.
  4. We need to disambiguate many idiomatics which have different (or opposite) meaning in the two languages we are talking about (it's the case for example of hoş geldiniz, used as welcome in Turkish and as goodbye in Azerbaijani), and I've got some resources to work on it.

Detailed work plan

Weeks from 1 to 6: I'd like to work on dictionary and azmorph together. With scripts I think it is possible to add about 500 words per day, considering that at the same time I'll need to work on azmorph. I expect to have a bidix with 20.000 words by the end of the sixth week. At the same time I expect to have a full functioning azmorph, able to handle any particularity of Azerbaijani and to be at least on a beta release level. Weeks 7-8: Translation and transfer rules. Even though languages are really similar, there are many differences, especially in style. Even verbal moods are not 100% coinciding, so that some work will be needed, especially on the turkish side, to handle differences properly and have a good translation and not just a similar translation. Weeks 9-10-11: Heavy Testing and bug fix. By this time I expect to have a ~20.000 words dictionary, a beta quality azmorph and a full set of translation and transfer rules. So that, having a complete environment it is time to move to testing and bug fix. I'm planning to use different corpora (METU, setimes, wikipedia) to check missing words in bidix, manually fix rules, fix bugs I'll surely find in azmorph/rules/bidix. Week 12: Final clean-up, everything should be fixed by this time. This week should be left for minor fixing and to handle what was left as "TODO if you have time".

Tools and references I will be using