Difference between revisions of "User:Mfoat/GSoC 2012 Application"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
   
 
== Why is that you are interested in machine translation? ==
 
== Why is that you are interested in machine translation? ==
  +
Machine translation is one of the most science intensive and at the same time most interesting and demanded problems of the present, in the era of exponential growth of information volume, of Internet global web and IT technologies development. Besides, machine translation can be considered as an applied platform of Computer Linguistics which experiments and develops all the possible methods, tools and technologies of data processing in natural languages (NLP). Therefore, it can be viewed as some kind of indicator of development of all the area of Applied Computer Linguistics which is very interesting for me as I already have experience in programming within NLP systems development.
  +
  +
==Why is it that they are interested in the Apertium project?==
  +
Now in the market there are a great number of machine translation systems which allow translation from one language to another. But, as a rule, those are systems supporting mainly the European languages; first of all they are focused on a language pair including English. Besides, those systems are not open, which makes it impossible to use them for creation of other pairs of languages. At the same time, a very important task is research of other groups of languages including minor, disappearing languages. Their introduction and use in information technologies is the only way of promoting not only preservation, but also development of these languages.
  +
  +
In this regard Apertium is a unique platform with an open code, which uses effective mathematical apparatus from the area of final automats, by means of which lexical units of the language are transformed into transducers of finite states. This provides the independence of algorithms from the language and gives the chance to create bidirectional systems with rather high speed of analysis and synthesis.
  +
  +
==Which of the published tasks are you interested in? What do you plan to do?==
  +
'''Title: '''
  +
APERTIUM-TT-TR: MACHINE TRANSLATION BETWEEN TATAR AND TURKISH LANGUAGES
  +
  +
===Why should Google and Apertium sponsor it?===
  +
Currently language pairs from the Turkic group, including such widespread languages as Tatar and Turkish, have poor coverage in Apertium. At the same time, development of such a language pair which possesses sufficient proximity at all language levels may allow in the future, without essential loss of translation quality, the use of other translation directions as well, for example, the Tatar - Russian translation through the Turkish-Russian Online Translator from Google.
  +
  +
Besides, as it is specified in the paper [6], according to its authors D.Sh. Suleymanov and R.A. Gilmullin, nowadays the development of Multilanguage systems of data processing is carried out according to the scheme: NL1  NL2, NL3, …, NLn. At the same time, the following scheme is considered very promising: {Group of related languages1}  NL1  NL2  {Group of related languages2}. Such an approach can result very effective for such groups of languages as Turkic, possessing strong similarity and regularity at all language levels. The problems connected with the structure of the text (syntax) are thus minimized; the main attention is given to lexical disambiguation. Owing to the fact that the text structure for both languages is almost identical, the translation between languages does not demand deep penetration into semantics of the text.
  +
  +
Therefore, Apertium can most effectively perform as a pragmatically oriented system for creating Multilanguage systems of data processing in one language group, which uses for translation morphological models, implemented on the basis of finite state transducers (FST). This is especially effective for the development of the Tatar-Turkish pair of languages possessing most powerful and thus almost automatic morphology.
  +
  +
===How and who will it benefit in society?===
  +
According to Wikipedia and the encyclopedia "Round the world", the number of Tatars in the world rounds 8 million people, about 5310,6 thousand of them live in Russia. The Tatar language is one of the official languages of the Republic of Tatarstan being part of the Russian Federation; therefore any official documents have their Tatar language version. Besides, in a run-up to the World Universiade in Kazan (the capital of the Republic of Tatarstan), there is special interest from the world community to the Republic and to the Tatar language. Another important fact is that the Republic of Tatarstan has friendly and close relations with Turkey; as result there are a great number of documents of different nature and content, but mainly historical documents valuable for both Tatar language and Turkish language speakers. All these facts demonstrate that the possibility of translation between these languages is highly demanded.
  +
  +
===Work plan===
  +
To create a Tatar-Turkish translator in Apertium, first of all, it is necessary to develop linguistic resources for translation supporting in the form of morphological and bilingual dictionaries formalized according to format requirements. As for the previous development of these resources in Apertium as a part of other language pairs from the Turkic group, the analysis has showed that morphological dictionaries for the Tatar language are not complete neither according to the presented root bases (about 2000 bases for the Tatar language), nor in respect to the affix classes most fully describing the Grammar of the language. The object of this application is developing morphological dictionaries of the Tatar and Turkish languages, which would represent a full model of the morphology of the languages (about 20000 bases), and also a bilingual dictionary (about 20000 equivalents) on the basis of the Tatar-Turkish dictionary under F.A. Ganiyev's edition. Besides, a programme for analyzing the results of morphological analysis by different criteria of data selection, taking into account all types of contextual environment (by LC_ - Left Context, _RC – Right Context, LC_RC left and right context), will be created. These developments will be further used by linguists to create rules for disambiguation.
  +
  +
===Week plan schedule of works===
  +
'''Week 1-2:'''
  +
  +
1. Apertium system installing
  +
  +
1.1 ubuntu
  +
  +
1.2 lttoolbox
  +
  +
1.3 libxml utils (xmllint etc.)
  +
  +
1.4 apertium
  +
  +
   
 
[[Category:GSoC 2012 Student Proposals]]
 
[[Category:GSoC 2012 Student Proposals]]

Revision as of 07:52, 5 April 2012

Contact Information

Name : Foat Musin

E-mail address : mfoat@mail.ru

Other contact information would be provided privately to the mentor.

Why is that you are interested in machine translation?

Machine translation is one of the most science intensive and at the same time most interesting and demanded problems of the present, in the era of exponential growth of information volume, of Internet global web and IT technologies development. Besides, machine translation can be considered as an applied platform of Computer Linguistics which experiments and develops all the possible methods, tools and technologies of data processing in natural languages (NLP). Therefore, it can be viewed as some kind of indicator of development of all the area of Applied Computer Linguistics which is very interesting for me as I already have experience in programming within NLP systems development.

Why is it that they are interested in the Apertium project?

Now in the market there are a great number of machine translation systems which allow translation from one language to another. But, as a rule, those are systems supporting mainly the European languages; first of all they are focused on a language pair including English. Besides, those systems are not open, which makes it impossible to use them for creation of other pairs of languages. At the same time, a very important task is research of other groups of languages including minor, disappearing languages. Their introduction and use in information technologies is the only way of promoting not only preservation, but also development of these languages.

In this regard Apertium is a unique platform with an open code, which uses effective mathematical apparatus from the area of final automats, by means of which lexical units of the language are transformed into transducers of finite states. This provides the independence of algorithms from the language and gives the chance to create bidirectional systems with rather high speed of analysis and synthesis.

Which of the published tasks are you interested in? What do you plan to do?

Title: APERTIUM-TT-TR: MACHINE TRANSLATION BETWEEN TATAR AND TURKISH LANGUAGES

Why should Google and Apertium sponsor it?

Currently language pairs from the Turkic group, including such widespread languages as Tatar and Turkish, have poor coverage in Apertium. At the same time, development of such a language pair which possesses sufficient proximity at all language levels may allow in the future, without essential loss of translation quality, the use of other translation directions as well, for example, the Tatar - Russian translation through the Turkish-Russian Online Translator from Google.

Besides, as it is specified in the paper [6], according to its authors D.Sh. Suleymanov and R.A. Gilmullin, nowadays the development of Multilanguage systems of data processing is carried out according to the scheme: NL1  NL2, NL3, …, NLn. At the same time, the following scheme is considered very promising: {Group of related languages1}  NL1  NL2  {Group of related languages2}. Such an approach can result very effective for such groups of languages as Turkic, possessing strong similarity and regularity at all language levels. The problems connected with the structure of the text (syntax) are thus minimized; the main attention is given to lexical disambiguation. Owing to the fact that the text structure for both languages is almost identical, the translation between languages does not demand deep penetration into semantics of the text.

Therefore, Apertium can most effectively perform as a pragmatically oriented system for creating Multilanguage systems of data processing in one language group, which uses for translation morphological models, implemented on the basis of finite state transducers (FST). This is especially effective for the development of the Tatar-Turkish pair of languages possessing most powerful and thus almost automatic morphology.

How and who will it benefit in society?

According to Wikipedia and the encyclopedia "Round the world", the number of Tatars in the world rounds 8 million people, about 5310,6 thousand of them live in Russia. The Tatar language is one of the official languages of the Republic of Tatarstan being part of the Russian Federation; therefore any official documents have their Tatar language version. Besides, in a run-up to the World Universiade in Kazan (the capital of the Republic of Tatarstan), there is special interest from the world community to the Republic and to the Tatar language. Another important fact is that the Republic of Tatarstan has friendly and close relations with Turkey; as result there are a great number of documents of different nature and content, but mainly historical documents valuable for both Tatar language and Turkish language speakers. All these facts demonstrate that the possibility of translation between these languages is highly demanded.

Work plan

To create a Tatar-Turkish translator in Apertium, first of all, it is necessary to develop linguistic resources for translation supporting in the form of morphological and bilingual dictionaries formalized according to format requirements. As for the previous development of these resources in Apertium as a part of other language pairs from the Turkic group, the analysis has showed that morphological dictionaries for the Tatar language are not complete neither according to the presented root bases (about 2000 bases for the Tatar language), nor in respect to the affix classes most fully describing the Grammar of the language. The object of this application is developing morphological dictionaries of the Tatar and Turkish languages, which would represent a full model of the morphology of the languages (about 20000 bases), and also a bilingual dictionary (about 20000 equivalents) on the basis of the Tatar-Turkish dictionary under F.A. Ganiyev's edition. Besides, a programme for analyzing the results of morphological analysis by different criteria of data selection, taking into account all types of contextual environment (by LC_ - Left Context, _RC – Right Context, LC_RC left and right context), will be created. These developments will be further used by linguists to create rules for disambiguation.

Week plan schedule of works

Week 1-2:

1. Apertium system installing

1.1 ubuntu

1.2 lttoolbox

1.3 libxml utils (xmllint etc.)

1.4 apertium