User:Zfe/Application

From Apertium
Jump to navigation Jump to search

Who!?

Name: Gianluca Grossi

email: me@ggrossi.com

irc: zfe @ freenode

other contacts: skype: giagrossi

Why is it you are interested in machine translation?

I've met Apertium project the first time @ freenode, being a regular on #linguistics. Even though I'm a Law student I've always had interest in programming and linguistics. I had a solid education in linguistics, especially at high school, where I've been taught both Latin and Ancient Greek, and since then I kept a strong interest in linguistics related matters. I'm interested in machine translation because I am a huge fan of automation, especially when it comes at typically-human tasks, like translation. Writing rules, morphological analyzers and transfer rules is a really challenging process from my mind and it keeps me from getting bored.

Why is it that they are interested in the Apertium project?

As I said before, I'm really interested in programming and linguistics. Apertium provides me with a free software, the chance to develop something that could be used, rewritten, modified by anybody for any kind of purpose, which makes me even more enthusiastic about the the possibility of creating a new language pair, since my work will be probably reused by somebody else for really different purposes. Apertium lacks of a Turkic language to Turkic language pair and being a huge fan of Turkic languages I think that it would be worth trying. Given the similarity of the languages I'd like to use, Apertium is the right environment to create such a language pair. In addition to that, I had the chance to interact with Apertium community members and it is an environment I really like, they look both knowledgeable and willing to help new members like me.

Which of the published tasks are you interested in? What do you plan to do?

Apertium-tr-az: machine translation between Turkish and Azerbaijani.

Why should Google and Apertium sponsor it? How and who will it benefit in society?

Apertium doesn't have any turkic-pair on release quality level. Turkish is the most widely spoken turkic language, with 80M speakers. On the other hand Azerbaijani has some 20M speakers, 8M if we consider just the Northern variant, which is the official language of Azerbaijan, with 12M people living mostly in Iran without having their language recognized as official language of the country where they live. For this reason, if compared to Turkish ones, there are few resources available in Azerbaijani, especially when it comes about educational tools. Aiming at a good result, I think it would be possible (and useful) to provide Azerbaijani native speakers with the chance to have resources in Turkish automatically translated in their native language. In addition to that it will be necessary for my project to develop a morph analyzer for Azerbaijani, which could be reused in future for other language pairs involving Azerbaijani.


Work Plan

What needs to be done

  1. We need a morph analyzer for Azerbaijani: I'm already working on it. It is called azmorph, it can be downloaded from Apertium SVN. I'm developing it starting from TRmorph, because working this way I can develop it much faster. At the moment it has a working vowel harmony, it can conjugate present aorist (which is the only present tense Azerbaijani has) affirmative and negative, it can handle cases and other noun inflections like plural suffixes, comitative/instrumental. You can easily check what azmorph is already able to do running the script you will find in the azmorph directory.
  2. We need a bidix dictionary: for closed categories I will proceed manually. For other categories we already have a csv file with ~3000words. For many words changes between tr and az are fixed[1], so that we can work over it with a script, to produce any possible az candidate for any given turkish word.
  3. We need translation rules, which I will compile personally.
  4. We need to disambiguate many idiomatics which have different (or opposite) meaning in the two languages we are talking about (it's the case for example of hoş geldiniz, used as welcome in Turkish and as goodbye in Azerbaijani), and I've got some resources to work on it[2]

Detailed work plan

Community bonding period: During this period I'd like to get to know more members of Apertium community. During past weeks I had the chance to talk and work together with Fran Tyers and had the chance to know Unhammer. I'd like to estabilish connections even with other members. In the meanwhile I want to get to know more deeply hsft and to write a short turkish-azerbaijani comparative grammar.


Weeks from 1 to 6: I'd like to work on dictionary and azmorph together. With scripts I think it is possible to add about 500 words per day, considering that at the same time I'll need to work on azmorph. I expect to have a bidix with +10.000 words by the end of the sixth week. At the same time I expect to have a full functioning azmorph, able to handle any particularity of Azerbaijani and to be at least on a beta release level. Deliverable: Updated bidix with +10.000 words - Beta version of azmorph

Weeks 7-8: Translation and transfer rules. Even though languages are really similar, there are many differences, especially in style. Even verbal moods are not 100% coinciding, so that some work will be needed, especially on the turkish side, to handle differences properly and have a good translation and not just a similar translation. Deliverable: Complete transfer system

Weeks 9-10-11: Heavy Testing and bug fix. By this time I expect to have a ~20.000 words dictionary, a beta quality azmorph and a full set of translation and transfer rules. So that, having a complete environment it is time to move to testing and bug fix. I'm planning to use different corpora (METU, setimes, wikipedia) to check missing words in bidix, manually fix rules, fix bugs I'll surely find in azmorph/rules/bidix. Deliverable: Minor fixes on azmorph, bidix and transfer rules

Week 12: Final clean-up, everything should be fixed by this time. This week should be left for minor fixing and to handle what was left as "TODO if you have time". Deliverable: Final version, hopefully ready for release

Tools and references I will be using

  1. Householder, F. W. Basic Course in Azerbaijani (Uralic & Altaic)
  2. Oztopcu,K. Colloquial Azerbaijani && Elementary Azerbaijani (Turk Dilleri Arastirmalari Dizisi)
  3. Swift, L. A Reference Grammar Of Modern Turkish
  4. Kornfilt, J. Turkish, Routledge
  5. Göksel, A., Kerslake, C. Turkish: a comprehensive grammar, Routledge
  6. Sultanzade, V. Turkish - Azerbaijani Dictionary of Interlingual Homonyms and Paronyms
  7. During GSOC I'll be in METU, in Ankara Turkey, where I can get help from my turkish linguistics students friends and the local azeri community

What will you be doing this summer?

If I won't make it in for GSOC, I will have to find a job and work in some boring place, translating (at best), waiting for my university to get open again. My summer is all free.

Ok sure, that's a lot of blabla, but does it translate anything yet

Yes, apertium-tr-az is already able to translate some simple sentences:

Tenses

Present
  • (tr) Içerim. → Içirəm. :: I drink
  • (tr) Içersin. → Içirsən. :: You drink
  • (tr) Içer. → Içir. :: He,she,it drinks
  • (tr) Içersiniz. → Içirsiniz. :: You (pl.) drink
  • (tr) Içeriz. → Içirik. :: We drink
  • (tr) Içerler. → Içirlər. :: They drink

Noun Inflection

  • nom (tr) Bira. → Pivə. :: Beer
  • (tr) Biram. → Pivəm. :: My beer
  • (tr) Biran. → Pivən. :: Your beer
  • (tr) Birası. → Pivəsi. :: His/Her/Its beer
  • (tr) Biramız. → Pivəmiz. :: Our beer
  • (tr) Biranız. → Pivəniz. :: Your beer
  • (tr) Biraları. → Pivənləri. :: Their beer

Sentences

  • (tr) Hastaneye gittim. → xəstəxanaya getdim. :: I went to the hospital
  • (tr) Kovayla bira içerim, ama sen bilmezsin. → Vedrəyle pivə içirəm, amma sen bilmirsən. :: I drink beer with a bucket, but you don't know it.


NOTES

  1. http://azer.com/aiweb/categories/magazine/13_folder/13_articles/kurtulush_azeri_turkish_13.pdf
  2. Turkish - Azerbaijani Dictionary of Interlingual Homonyms and Paronyms.