Difference between revisions of "User:Gantu/Application"

From Apertium
Jump to navigation Jump to search
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
Mirlan Ipasov <br/>
Mirlan Ipasov

gantu on IRC,#apertium,#hfst<br/>
gantu on IRC: <code>#apertium</code>, <code>#hfst</code>



Bio
Bio
Line 10: Line 12:
* Software Developer BaraanSoft Hyderabad/India 2005-2007
* Software Developer BaraanSoft Hyderabad/India 2005-2007
* Teaching Assistant at IAAU 2008
* Teaching Assistant at IAAU 2008
* Fedora Project Ambassador in Kyrgyzstan
Languages:
Languages:
Line 16: Line 19:
* Some: Kazakh,Uzbek
* Some: Kazakh,Uzbek


* Programming:
Programming:
*Java, Python, C
* Java, Python, C


==Why is it you are interested in machine translation?==
Machine translation is very important area where has not been done enough researches and studies. Especially on native languages of nations with a small population. As a computer scientist i would like to contribute in such important and interesting field


==Which of the published tasks are you interested in? What do you plan to do?==
== Title: Developing new Turkish(tr)-Kyrgyz(ky) apertium language pair. ==
Apertium is one of the effective and developing and at the same time working out of the box open source project based on machine translation out here. In almost all machine translation projects the Turkic languages are very poorly developed, some languages are not developed at all. I will develop a new Turkic language pair Turkish(tr)-Kyrgyz(ky) where Kyrgyz(ky) side will be completely developed from scratch.
And apertium project has all of the tools and platform to do that.
As a part of this project a new bilingual dictionary and morphological analyzer for Kyrgyz language will be developed.
Both of them could be used in many purposes.The project deliver ables could be used as translation tool from tr-ky and as educational tool as well. And it could help as a tutorial for other Turkic based languages.<br/>


'''Title:'''
Apertium-tr-ky: New Turkish-Kyrgyz language pair.
===Deliverables:===
1.Bilingual tr-ky dictionary<br/>
* will be prepared from already developed starDict dictionary has almost 4600 unique entries.<br/>
* 2-days investigation <br/>
* 12-days for dictionary build up<br/>
* total 2 weeks of work<br/>


2.A Morphological analyzer/generator for Kyrgyz(ky) language<br/>
* 7-days of investigation and reading <br/>
* 21-days of programming and preparing analyzer/generator<br/>
* 7-days for testing and documentation<br/>
* total of approximately 5 weeks.<br/>


==Reasons why Google and Apertium should sponsor it==
3.transfer rules.<br/>
In almost all machine translation projects the Turkic languages are very poorly developed, some languages are not developed at all. I will develop a new Turkic language pair Turkish (tr)-Kyrgyz (ky) where Kyrgyz (ky) side will be completely developed from scratch. Interesting part of Kyrgyz language is it is Cyrillic written but Turkic based language. It will be first language pair of its type in apertium project and other machine translation projects as well.
* 7-days of reading,preparing and testing transfer rules. <br/>
* total of 1 weeks<br/>
==How and who it will benefit in society?==
* 4.script to trim lexica <br/>
* 7-days of reading, programming and testing.<br/>
* total of 1 week<br/>


As a part of this project a new bilingual dictionary and morphological analyzer/generator for Kyrgyz language will be developed. Both of them could be used in many purposes. The project deliverable could be used as translation tool from tr-ky and as educational tool as well. And it could help as a tutorial for other Turkic based languages.
5.Deploying all into apertium and preparing apertium tr-ky language pair.<br/>
==Work and research done, resources==
* 10-days Testing translations and correcting errors.<br/>
* 11-days Documentation<br/>
* total of 3 weeks<br/>


===Language Information===
===Language Information===
Line 56: Line 43:
====Noun morphology====
====Noun morphology====
Kyrgyz language has several cases:<br/>
Kyrgyz language has several cases:<br/>
* absolute,

absolute, definite-accusative, dative, locative, ablative, genitive<br/>
* definite-accusative,
* dative,
* locative,
* ablative,
* genitive<br/>
Words in Kyrgyz language morphologically built by applying suffixes in following order:<br/>
Words in Kyrgyz language morphologically built by applying suffixes in following order:<br/>
plural suffix<br/>
* plural suffix<br/>
suffix of possession<br/>
* suffix of possession<br/>
personal suffix<br/>
* personal suffix<br/>
case-ending<br/>
* case-ending<br/>
<pre>
<pre <math>китеп (kitep)= book is the stem<br/></math>
китеп (kitep)= book is the stem
китеп+plural+pronoun китептер (kitepter) <br/>
китептер (kitepter) is books <br/>
китеп+plural+pronoun китептер (kitepter)
китептер (kitepter) is books
китептеримден (kitepterimden) китеп +(pl)тер +(pronoun)им +(case)ден from by books</pre>
китептеримден (kitepterimden) китеп +(pl)тер +(pronoun)им +(case)ден from by books
</pre>


A noun has 5 cases
A noun has 5 cases
Line 100: Line 93:
| o kitaptan
| o kitaptan
|}
|}




====Agglutination case====
====Agglutination case====
ex:
ex:
<pre>
<pre>
verb = окуу (okuu) = to read, stem = оку (oku) read <br/>
verb = окуу (okuu) = to read, stem = оку (oku) read


(ky) окуп жатам (okwp jatam )
(ky) окуп жатам (okwp jatam )
оку+п жат+ам (present continous, pr1, kyrgyz)
оку+п жат+ам (present continous, pr1, kyrgyz)
I am reading
I am reading

(Mostly verbs in present continuous tense defined by two verbs. ex: оку+п жат+ам, жатам --> helping verb to define the present continuous tense).
(Mostly verbs in present continuous tense defined by two verbs. ex: оку+п жат+ам, жатам --> helping verb to define the present continuous tense).


Line 118: Line 110:
gidiyorum = I am going
gidiyorum = I am going
git (lemma) -i -yor (for continuous tense) -um (for first personal pronoun) (turkish)
git (lemma) -i -yor (for continuous tense) -um (for first personal pronoun) (Turkish)


(ky) окуп жатам (okup jatam) = I am reading
(ky) окуп жатам (okup žatam) = I am reading
оку (lemma) +п(for continuous tense) жат(second verb for continuous tense) +ам(for first personal pronoun) (present continuous, pr1, kyrgyz)
оку (lemma) +п(for continuous tense) жат(second verb for continuous tense) +ам (for first personal pronoun) (present continuous, p1sg, kyrgyz)


окудум (okudum) = I read
окудум (okudum) = I read
оку (lemma) +ду (for past tense) +м(for first personal pronoun) (past tense, pr1, kyrgyz)
оку (lemma) +ду (for past tense) +м(for first personal pronoun) (past tense, p1sg, Kyrgyz)
</pre>
</pre>


====Vowel harmony====
====Vowel harmony====
Generally there is vowel harmony in Kyrgyz language but words imported from other languages like Russian do not obey vowel harmony restrictions.
Generally there is vowel harmony in Kyrgyz language but words imported from other languages like Russian do not obey vowel harmony restrictions.

<pre>
<pre>
китеп (kitep) book
китеп (kitep) book
китептер (kitepter) books
китептер (kitepter) books
китептерим (kitepterim) my books
китептерим (kitepterim) my books

китептеримден (kitepterimden) китеп +(pl)тер +(pronoun)им +(case)ден from by books
китептеримден (kitepterimden)
китеп.тер.им.ден
китеп+Pl+Px1Sg+Abl
`From my books.'

In Turkish the word "bira" (beer) is imported from French, and in Kyrgyz, пиво "pivo" is imported from Russian


пиво(ky)-->bira(tr)-->beer is a word imported from Russian.
пиво (pivo) beer
Пиво (pivo) beer
пивалар (pivalar) beers
пивалар (pivalar) beers
пиваларым (pïvalarım) my beers
пиваларым (pïvalarım) my beers
Line 221: Line 219:
|We play
|We play
|-
|-
|ойнощот (ojnoshot)
|ойнощот (ojnošot)
|oynarlar
|oynarlar
|The play
|The play
|}
|}

===Dictionary Information===
===Dictionary Information===
As I mentioned above one part of the project is building bilingual dictionary. Fortunately I am not going to build it from nothing. There is StarDict version of Turkish-Kyrgyz dictionary without part-of-speech definitions. StarDict tr-ky dictionary consist 4600 unique entries of nouns and verbs mixed. Part-of-speeches could be extracted by using trmorph(open-source Turkish morphological analyzer). If I could get part-of-speeches of Turkish words in dictionary I could use it in building apertium tr-ky bilingual dictionary. I am planning to cover at least 50% of this dictionary in 12 days.
As I mentioned above one part of the project is building bilingual dictionary. Fortunately I am not going to build it from nothing. There is StarDict version of Turkish-Kyrgyz dictionary without part-of-speech definitions. StarDict tr-ky dictionary consist 4,600 unique entries of nouns and verbs mixed. Part-of-speech could be extracted by using trmorph (open-source Turkish morphological analyser).

Lexical Selection Example:
<pre>
Verb (tr) çalışmak (en) to work

(tr) Ahmet fabrikada çalışıyor. (en) Ahmet is working in a factory. (ky) Ахмет фабрикада иштейт (Axmet fabrikada ištejt)
(tr) Ahmet ders çalışıyor. (en) Ahmet is studying. (ky) Ахмет сабак окул жатат. (Axmet sabak okul žatat)

Adjective (tr) hasta (en) ill

(tr) Ahmet hasta. (en) Ahmet is ill. (ky) Ахмет ооруп жатат. (Ahmet oorup žatat)
(tr) Hasta geldi. (en) (the) ill (person) arrived (ky) оорулуу келди. (Ooruluu keldi)

</pre>
If the words are ambiguous for stem/POS? I am trying come up with solution soon.

If I could get part-of-speeches of Turkish words in dictionary I could use it in building apertium tr-ky bilingual dictionary. I am planning to cover at least 3000 of words in 12 days.

===Kyrgyz Language morphological analyser/generator===
I am developing it by using HFST (Open source project for developing morphological analyser/generators).I am still learning HFST, but simple Kyrgyz language analyser is ready according to Francis Tyers tutorial (http://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST).


===Kyrgyz Language morphological analyzer/generator===
I am developing it by using HFST (Open source project for developing morphological analyzer/generators).I am still learning hfst ,but simple Kyrgyz language analyzer is ready according to Francis Tyers tutorial (http://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST).
===What is done already===
===What is done already===
An apertium project tr-ky is created as a project on SourceForge which consist of bilingual dictionary and simple Kyrgyz language morphological analyzer.
An apertium project tr-ky is created as a project on SourceForge which consist of bilingual dictionary and simple Kyrgyz language morphological analyzer.
==Work Plan==
I tried to approximately plan my work according to GsoC given time line.

===Community Bonding Period===
* Set up working environment is done
* Familiarize with various Apertium tools needed for development.
* Gather grammar resources for Turkish-Kyrgyz languages.
* Get familiar with HFST.
* Figure out how to extract StarDict tr-ky dictionary into apertium-tr-ky bilingual dictionary
* Figure out a plan for kyrgyz morphological analyser/generator.

'''Week 1'''
Build apertium-tr-ky bilingual dictionary by extracting it from StarDict dictionary.

'''Week 2'''
Build apertium-tr-ky bilingual dictionary by extracting it from StarDict dictionary.

'''Deliverable 1:''' Bilingual tr-ky dictionary

'''Week 3'''
Preparing rules and theory of morphological analyzer/generator.

'''Week 4'''
Dealing with monolingual Kyrgyz dictionary for morphological analyzer/generator

'''Week 5'''
Dealing with monolingual Kyrgyz dictionary for morphological analyzer/generator
Building morphological analyzer/generator for Kyrgyz language on HFST.

'''Week 6'''
Building morphological analyzer/generator for Kyrgyz language on HFST.
Testing morphological analyzer/generator.

'''Week 7'''
Testing morphological analyzer/generator.
Preparing documentation for morphological analyzer/generator.

'''Deliverable 2:''' kymorph working morphological analyzer/generator for Kyrgyz language

'''Week 8'''
Prepare transfer rules from Turkish-Kyrgyz language.

'''Week 9'''
Write a script for trimming lexica.

'''Week 10'''
Make work all deliverable together as a language pair apertium-tr-ky.

'''Week 11'''
Testing language pair apertium-tr-ky and fixing possible errors.

'''Week 12'''
Testing language pair apertium-tr-ky and fixing possible errors.
Documentation of apertium-tr-ky language pair.

'''Deliverable 3:''' working apertium-tr-ky language pair

==Non- Summer of Code plans==

GsoC2011 is my main plan for this summer. I am employed at university as a teaching assistant but I am sure that I will have at least 35 free hours in a week to develop for Apertium.

==Resources==
1.http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO<br/>
2.http://wiki.apertium.org/wiki/Turkish_to_Azerbaijani<br/>
3.http://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST<br/>
4.http://www.let.rug.nl/~coltekin/trmorph/<br/>
5.http://wiki.apertium.org/wiki/Kyrgyz<br/>
6.http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/<br/>

[[Category:GSoC_2011_Student_Proposals]]

Latest revision as of 11:06, 7 April 2011

Mirlan Ipasov

gantu on IRC: #apertium, #hfst


Bio

  • B. Computer Science at IAAU Bishkek/Kyrgyzstan http://www.iaau.edu.kg in 2000-2004
  • MTech in Artificial Intelligence at HCU Hyderabad/India 2005-2007
  • PhD Student at IAAU 2010-Approximate completing year 2014
  • System Administrator at IAAU Bishkek/Kyrgyzstan 2004-2005
  • Software Developer BaraanSoft Hyderabad/India 2005-2007
  • Teaching Assistant at IAAU 2008
  • Fedora Project Ambassador in Kyrgyzstan

Languages:

  • Native: Kyrgyz, Russian, Turkish
  • Fluent: English
  • Some: Kazakh,Uzbek

Programming:

  • Java, Python, C

Why is it you are interested in machine translation?[edit]

Machine translation is very important area where has not been done enough researches and studies. Especially on native languages of nations with a small population. As a computer scientist i would like to contribute in such important and interesting field

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title: Apertium-tr-ky: New Turkish-Kyrgyz language pair.


Reasons why Google and Apertium should sponsor it[edit]

In almost all machine translation projects the Turkic languages are very poorly developed, some languages are not developed at all. I will develop a new Turkic language pair Turkish (tr)-Kyrgyz (ky) where Kyrgyz (ky) side will be completely developed from scratch. Interesting part of Kyrgyz language is it is Cyrillic written but Turkic based language. It will be first language pair of its type in apertium project and other machine translation projects as well.

How and who it will benefit in society?[edit]

As a part of this project a new bilingual dictionary and morphological analyzer/generator for Kyrgyz language will be developed. Both of them could be used in many purposes. The project deliverable could be used as translation tool from tr-ky and as educational tool as well. And it could help as a tutorial for other Turkic based languages.

Work and research done, resources[edit]

Language Information[edit]

Noun morphology[edit]

Kyrgyz language has several cases:

  • absolute,
  • definite-accusative,
  • dative,
  • locative,
  • ablative,
  • genitive

Words in Kyrgyz language morphologically built by applying suffixes in following order:

  • plural suffix
  • suffix of possession
  • personal suffix
  • case-ending
 
китеп (kitep)= book is the stem
китеп+plural+pronoun китептер (kitepter)
китептер (kitepter) is books
китептеримден (kitepterimden)  китеп +(pl)тер +(pronoun)им +(case)ден	from by books

A noun has 5 cases

ky Gloss tr
китеп (kitep) book kitap
китептин (kiteptin) of that book o kitabın
китепке (kitepke) to that book o kitaba
китепти (kitepti) that book o kitabı
китепте (kitepte) in that book o kitapta
китептен (kitepten) from that book o kitaptan

Agglutination case[edit]

ex:

verb = окуу (okuu) = to read, stem = оку (oku) read

(ky) окуп жатам (okwp jatam )
	оку+п жат+ам (present continous, pr1, kyrgyz)
	I am reading

(Mostly verbs in present continuous tense defined by two verbs. ex: оку+п жат+ам, жатам --> helping verb to define the present continuous tense).

(tr) okuyorum
	oku+yor+um (present continous, pr1, turkish)
	I am reading
	
	gidiyorum = I am going
	git (lemma) -i -yor (for continuous tense) -um (for first personal pronoun)   (Turkish)

(ky) окуп жатам (okup žatam) = I am reading
	оку (lemma) +п(for continuous tense) жат(second verb for continuous tense) +ам (for first personal pronoun) (present continuous, p1sg, kyrgyz)

	окудум (okudum) = I read
	оку (lemma) +ду (for past tense) +м(for first personal pronoun) (past tense, p1sg, Kyrgyz)

Vowel harmony[edit]

Generally there is vowel harmony in Kyrgyz language but words imported from other languages like Russian do not obey vowel harmony restrictions.

китеп         (kitep)		book
китептер      (kitepter)	books
китептерим    (kitepterim)	my books

китептеримден           (kitepterimden) 
китеп.тер.им.ден	
китеп+Pl+Px1Sg+Abl	
`From my books.'

In Turkish the word "bira" (beer) is imported from French, and in Kyrgyz, пиво "pivo" is imported from Russian

пиво (pivo)			beer
пивалар (pivalar)		beers
пиваларым (pïvalarım)	my beers
пиваларымдан (pïvalarımdan)	from my beers

Noun and Verb comparisons[edit]

Noun :

Kyrgyz Turkish Gloss
китеп (kitep) kitab book
китептер (kitepter) kitaplar books
китебим (kitebim) kitabım my book
китептерим (kitepterim) kitaplarım my books
китептен (kitepten) kitaptan from book
китептерден (kitepterden) kitaplardan from books
китебимден (kitebimden) kitabımdan from my book
китептеримден (kitepterimden) kitaplarımdan from my books

Verb:

Kyrgyz Turkish Gloss
ойнойм (ojnojm) oynarım I play
ойнойсуң (ojnojsuŋ) oynarsın You play
ойнойт (ojnojt) oynar He plays
ойнойт (ojnojt) oynar She plays
ойнойт (ojnojt) oynar It plays
ойнойсуңуз (ojnojsuŋuz) oynarsınız You (pl.) play
ойнойбуз (ojnojbuz) oynarız We play
ойнощот (ojnošot) oynarlar The play

Dictionary Information[edit]

As I mentioned above one part of the project is building bilingual dictionary. Fortunately I am not going to build it from nothing. There is StarDict version of Turkish-Kyrgyz dictionary without part-of-speech definitions. StarDict tr-ky dictionary consist 4,600 unique entries of nouns and verbs mixed. Part-of-speech could be extracted by using trmorph (open-source Turkish morphological analyser).

Lexical Selection Example:

Verb (tr) çalışmak (en) to work 

(tr) Ahmet fabrikada çalışıyor. (en) Ahmet is working in a factory.  (ky) Ахмет фабрикада иштейт (Axmet fabrikada ištejt)
(tr) Ahmet ders çalışıyor.      (en) Ahmet is studying.              (ky) Ахмет сабак окул жатат. (Axmet sabak okul žatat)

Adjective (tr) hasta (en) ill

(tr) Ahmet hasta.     (en) Ahmet is ill.               (ky) Ахмет ооруп жатат.         (Ahmet oorup žatat)
(tr) Hasta geldi.     (en) (the) ill (person) arrived  (ky) оорулуу келди. (Ooruluu keldi)

If the words are ambiguous for stem/POS? I am trying come up with solution soon.

If I could get part-of-speeches of Turkish words in dictionary I could use it in building apertium tr-ky bilingual dictionary. I am planning to cover at least 3000 of words in 12 days.

Kyrgyz Language morphological analyser/generator[edit]

I am developing it by using HFST (Open source project for developing morphological analyser/generators).I am still learning HFST, but simple Kyrgyz language analyser is ready according to Francis Tyers tutorial (http://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST).

What is done already[edit]

An apertium project tr-ky is created as a project on SourceForge which consist of bilingual dictionary and simple Kyrgyz language morphological analyzer.

Work Plan[edit]

I tried to approximately plan my work according to GsoC given time line.

Community Bonding Period[edit]

  • Set up working environment is done
  • Familiarize with various Apertium tools needed for development.
  • Gather grammar resources for Turkish-Kyrgyz languages.
  • Get familiar with HFST.
  • Figure out how to extract StarDict tr-ky dictionary into apertium-tr-ky bilingual dictionary
  • Figure out a plan for kyrgyz morphological analyser/generator.

Week 1 Build apertium-tr-ky bilingual dictionary by extracting it from StarDict dictionary.

Week 2 Build apertium-tr-ky bilingual dictionary by extracting it from StarDict dictionary.

Deliverable 1: Bilingual tr-ky dictionary

Week 3 Preparing rules and theory of morphological analyzer/generator.

Week 4 Dealing with monolingual Kyrgyz dictionary for morphological analyzer/generator

Week 5 Dealing with monolingual Kyrgyz dictionary for morphological analyzer/generator Building morphological analyzer/generator for Kyrgyz language on HFST.

Week 6 Building morphological analyzer/generator for Kyrgyz language on HFST. Testing morphological analyzer/generator.

Week 7 Testing morphological analyzer/generator. Preparing documentation for morphological analyzer/generator.

Deliverable 2: kymorph working morphological analyzer/generator for Kyrgyz language

Week 8 Prepare transfer rules from Turkish-Kyrgyz language.

Week 9 Write a script for trimming lexica.

Week 10 Make work all deliverable together as a language pair apertium-tr-ky.

Week 11 Testing language pair apertium-tr-ky and fixing possible errors.

Week 12 Testing language pair apertium-tr-ky and fixing possible errors. Documentation of apertium-tr-ky language pair.

Deliverable 3: working apertium-tr-ky language pair

Non- Summer of Code plans[edit]

GsoC2011 is my main plan for this summer. I am employed at university as a teaching assistant but I am sure that I will have at least 35 free hours in a week to develop for Apertium.

Resources[edit]

1.http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO
2.http://wiki.apertium.org/wiki/Turkish_to_Azerbaijani
3.http://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST
4.http://www.let.rug.nl/~coltekin/trmorph/
5.http://wiki.apertium.org/wiki/Kyrgyz
6.http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/