User:Mfoat/GSoC 2012 Application

From Apertium
Jump to navigation Jump to search

Contact Information[edit]

Name : Foat Musin

E-mail address : mfoat@mail.ru

Other contact information would be provided privately to the mentor.

Why is that you are interested in machine translation?[edit]

Machine translation is one of the most science intensive and at the same time most interesting and demanded problems of the present, in the era of exponential growth of information volume, of Internet global web and IT technologies development. Besides, machine translation can be considered as an applied platform of Computer Linguistics which experiments and develops all the possible methods, tools and technologies of data processing in natural languages (NLP). Therefore, it can be viewed as some kind of indicator of development of all the area of Applied Computer Linguistics which is very interesting for me as I already have experience in programming within NLP systems development.

Why is it that they are interested in the Apertium project?[edit]

Now in the market there are a great number of machine translation systems which allow translation from one language to another. But, as a rule, those are systems supporting mainly the European languages; first of all they are focused on a language pair including English. Besides, those systems are not open, which makes it impossible to use them for creation of other pairs of languages. At the same time, a very important task is research of other groups of languages including minor, disappearing languages. Their introduction and use in information technologies is the only way of promoting not only preservation, but also development of these languages.

In this regard Apertium is a unique platform with an open code, which uses effective mathematical apparatus from the area of final automats, by means of which lexical units of the language are transformed into transducers of finite states. This provides the independence of algorithms from the language and gives the chance to create bidirectional systems with rather high speed of analysis and synthesis.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title: APERTIUM-TT-TR: MACHINE TRANSLATION BETWEEN TATAR AND TURKISH LANGUAGES

Why should Google and Apertium sponsor it?[edit]

Currently language pairs from the Turkic group, including such widespread languages as Tatar and Turkish, have poor coverage in Apertium. At the same time, development of such a language pair which possesses sufficient proximity at all language levels may allow in the future, without essential loss of translation quality, the use of other translation directions as well, for example, the Tatar - Russian translation through the Turkish-Russian Online Translator from Google.

Besides, as it is specified in the paper [6], according to its authors D.Sh. Suleymanov and R.A. Gilmullin, nowadays the development of Multilanguage systems of data processing is carried out according to the scheme: NL1 <=> NL2, NL3, …, NLn. At the same time, the following scheme is considered very promising: {Group of related languages1} <=> NL1 <=> NL2 <=> {Group of related languages2}. Such an approach can result very effective for such groups of languages as Turkic, possessing strong similarity and regularity at all language levels. The problems connected with the structure of the text (syntax) are thus minimized; the main attention is given to lexical disambiguation. Owing to the fact that the text structure for both languages is almost identical, the translation between languages does not demand deep penetration into semantics of the text.

Therefore, Apertium can most effectively perform as a pragmatically oriented system for creating Multilanguage systems of data processing in one language group, which uses for translation morphological models, implemented on the basis of finite state transducers (FST). This is especially effective for the development of the Tatar-Turkish pair of languages possessing most powerful and thus almost automatic morphology.

How and who will it benefit in society?[edit]

According to Wikipedia and the encyclopedia "Round the world", the number of Tatars in the world rounds 8 million people, about 5310,6 thousand of them live in Russia. The Tatar language is one of the official languages of the Republic of Tatarstan being part of the Russian Federation; therefore any official documents have their Tatar language version. Besides, in a run-up to the World Universiade in Kazan (the capital of the Republic of Tatarstan), there is special interest from the world community to the Republic and to the Tatar language. Another important fact is that the Republic of Tatarstan has friendly and close relations with Turkey; as result there are a great number of documents of different nature and content, but mainly historical documents valuable for both Tatar language and Turkish language speakers. All these facts demonstrate that the possibility of translation between these languages is highly demanded.

Work plan[edit]

To create a Tatar-Turkish translator in Apertium, first of all, it is necessary to develop linguistic resources for translation supporting in the form of morphological and bilingual dictionaries formalized according to format requirements. As for the previous development of these resources in Apertium as a part of other language pairs from the Turkic group, the analysis has showed that morphological dictionaries for the Tatar language are not complete neither according to the presented root bases (about 2000 bases for the Tatar language), nor in respect to the affix classes most fully describing the Grammar of the language. The object of this application is developing morphological dictionaries of the Tatar and Turkish languages, which would represent a full model of the morphology of the languages (about 20000 bases), and also a bilingual dictionary (about 20000 equivalents) on the basis of the Tatar-Turkish dictionary under F.A. Ganiyev's edition. Besides, a programme for analyzing the results of morphological analysis by different criteria of data selection, taking into account all types of contextual environment (by LC_ - Left Context, _RC – Right Context, LC_RC left and right context), will be created. These developments will be further used by linguists to create rules for disambiguation.

Week plan schedule of works[edit]

Week 1-2:

1. Apertium system installing

1.1 ubuntu

1.2 lttoolbox

1.3 libxml utils (xmllint etc.)

1.4 apertium

Week 3-7:

2. Creating morphological dictionaries for the Tatar language and the Turkish language (about 20 000 lexical units, respectively)

2.1. Creating the dictionaries of bases and paradigms on the basis of existing two-level automatic model of morphology of the Tatar and Turkish languages on the basis of the programme PC KIMMO [11] tools.

2.2. Converting the dictionaries into the Apertium format.

Week 8-11:

3. Development of the bilingual (Tatar-Turkish) dictionary.

3.1. Scanning and recognition of the Tatar-Turkish dictionary (the Tatar-Turkish dictionary edited by F.A. Ganiyev).

3.2. Developing a programme for automatic extraction of translated correspondences and converting them into an XML format.

3.3. Correcting mistakes, editing.

3.4. Creating a bilingual dictionary with normalized bases indicating the grammatical characteristics on the basis of the Tatar-Turkish dictionary and morphological dictionaries (about 20 000 bases).

Week 11-12:

4. Development of a programme application for the analysis of results of morphological analysis according to different criteria of data selection taking into consideration all types of contextual environment.

Week 11-12:

5. Testing and debugging of the Tatar-Turkish machine translation.

List your skills and give evidence of your qualifications[edit]

Currently I am the 5-th (last)-year student of the Kazan (Volga) Federal University, Institute of Computational Mathematics and Information Technologies, specialty “Applied mathematics and informatics”. In the course of the study I have successfully passed such disciplines as Discrete Mathematics, Programming, Logics, Information Theory, Formal Grammar and Languages, Theory of Automats, Mathematical Linguistics, among others.

I work as a Java developer in a company which creates big Web applications for healthcare.

I prefer working with higher-level languages like Java, C ++, Javascript. As the main OS I use Windows and Linux alternately. I like participating in open-source projects.

List any non-Summer-of-Code plans you have for the Summer[edit]

Google Summer of Code is my one and only plan for this summer.

References[edit]

1. http://www.apertium.org

2.http://wiki.apertium.org/wiki/%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D1%81%D1%82%D0%B2%D0%BE_%D0%BF%D0%BE_%D1%81%D0%BE%D0%B7%D0%B4%D0%B0%D0%BD%D0%B8%D1%8E_%D0%BD%D0%BE%D0%B2%D0%BE%D0%B9_%D1%8F%D0%B7%D1%8B%D0%BA%D0%BE%D0%B2%D0%BE%D0%B9_%D0%BF%D0%B0%D1%80%D1%8B

3. https://apertium.svn.sourceforge.net/svnroot/apertium

4. https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf

5. http://ru.wikipedia.org/wiki/%D0%A2%D0%B0%D1%82%D0%B0%D1%80%D1%8B

6. http://www.dissercat.com/content/matematicheskoe-modelirovanie-v-mnogoyazykovykh-sistemakh-obrabotki-dannykh-na-osnove-avtoma

7. Suleymanov D.Sh., Gilmullin R.A. Realisation of the contextual correspondences А:а, А:ä in the file of phonological rules // Proceedings of the Mathematical centre by N.I. Lobachensky. Vol.4. Computational Linguistics. – Kazan: UNIPRESS, 1999. – p.127-137.

8. Gilmullin R.А. Realisation of the contextual correspondences I:ı, I:е and I:0 in the file of phonological rules // Proceedings of the Mathematical centre by N.I. Lobachensky. Vol.4. Computational Linguistics. – Kazan: UNIPRESS, 1999. – С.51-58.

9. Suleymanov D.Sh., Gilmullin R.A. Realisation of the contextual correspondences V:u, V:U, V:0, Y:I and Y:o in the file of phonological rules // DIALOG’2000 International workshop Proceedings. Vol. 2. –p.390.

10. Medin, D.L., Schaffer, M.M. Context theory of classification learning //Psychological Review, 85 (1978). – P.207-238.

11. Evan L. Antworth. PC-KIMMO: A Two-level Processor for Morphological Analysis // Summer Institute of Linguistics / Occasional Publication in Academic Computing Number 16. -P.263.