Difference between revisions of "User:Anarsaikhan"

Latest revision as of 13:51, 28 March 2018

1 Contact info
2 Why is it that you are interested in Apertium?
3 Which of the published tasks are you interested in? What do you plan to do?
4 How and who will benefit in society? Why should Google and Apertium sponsor it?
5 Work plan
6 Skills and Qualifications
7 Non-Summer-of-Code plans you have for the Summer

Contact info[edit]

Name: Anarsaikhan Tuvshinjargal

Location: Swarthmore College (Pennsylvania, USA)

E-mail: atuvshi1@swarthmore.edu

Phone number: +1 484-474-7856 (US)

IRC: anarsaikhan

Github: Anarsaikhan

Timezone: UTC-4 (Philadelphia) / UTC +8 (Ulaanbaatar)

Why is it that you are interested in Apertium?[edit]

Apertium offers the ideal opportunity for me to contribute something meaningful to my Mongolian heritage. Founded on the principles of preserving culture and heritage through language, Apertium connects the realms of the ancient and modern through advances in machine translation platforms.

I am currently double majoring in computer science and cognitive science as this is the perfect combination of my love of computer intelligence and my fascination with the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because, to me, it always seems that the secret “bridge” spanning human intelligence lies there. By working with the Apertium’s machine translation platform, not only will I explore this passion of mine, but I will pursue my life purpose “to create and contribute something totally unique to this world”.

I am interested in Apertium because I believe that preserving dying languages could potentially help us understand how the human brain categorizes objects (ways of viewing) and how the human mind takes and stores information. Through the study of the world's languages, we can potentially discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Adopt a new language pair: mvf-khk (Mongolian Script - Mongolian Cyrillic). I want to do translation from Mongolian Script to Mongolian Cyrillic.

Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use.

One is Mongolian Cyrillic (khk), written horizontally like this:

Монгол Кирилл үсэг

The other one is the traditional Mongolian script (“Mongol bichig”, mvf) written vertically down the page like this:

Currently, there are no existing machine translation systems for this pair
They are a super closely related pair. It is just two writing systems for one language
There are plenty of resources already existing for this pair including materials written in both
There are mentors who can evaluate my work

The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. A hallmark of Mongolian script is being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes.

Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level.

There is a huge amount of parallel translated (mvf-khk) text with ready a vocabulary and rules (for example [1]).

How and who will benefit in society? Why should Google and Apertium sponsor it?[edit]

The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born.

The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the most meaningful intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.

Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015, under which the traditional Mongolian script would become the national script in 2025. News of this future law made the whole nation rejoice and has been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945.

There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping millennium old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.

Development of this language pair will tremendously contribute to the development, print, and distribution of new textbooks and educational documents. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, students, and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state, but the history from the early 13th century until today.

A complete machine translation platform for this pair will be “the passport to” Mongolia’s intellectual and cultural heritages not only for the younger generation but also for the whole Mongolian and International communities. Millions of invaluable historical documents that exist solely in Mongolian Script will become readily available to a worldwide audience.

Work plan[edit]

Post application period[edit]

During the post-application period, the following plan will become more detailed, as I work closer to the task:

Diving into Apertium documentation and manuals
Finish Coding challenge with WER~55%
Learn Constraint Grammar and Lexical Selection rules.
Add/Edit/Expand both the mvf and khk dictionaries
Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)

Community bonding period[edit]

Get monolingual and bilingual aligned corpora for further analysis.
Learn to use dictionaries and tools in practice.

Work period[edit]

Part 1 (weeks 1-4):[edit]

Week 1:

Write test scripts
Add transfer rules for nouns, pronouns.
Start working for pronouns, adverbs, and adjectives
Add appropriate rules/stems.
Achieve a WER < 20% for 1 basic text

Week 2:

Add transfer rules for adjectives, adverbs
Take another 500-word story.
Target: WER <50%
Post-edit translated texts. Analyze and look for common rules and add rules

Week 3:

Finish with lexical selection rules and chunking.
Start working on disambiguation and its solutions
Refactoring and documentation.

Week 4:

Run corpus testing to analyze the improvement.
Improve morphological analyzer

Deliverable #1, June 11 - 15

Part 2 (weeks 5-8):[edit]

Week 5:

Find good parallel corpora and add words in decreasing frequency in apertium-mvf.
Coverage ~45%
Parallelly start working of khk-mvf bilingual dictionary

Week 6:

Work on a ~ 700-word story
Calculate WER, PER, and document
Target WER <=40%
Even up nouns, pronouns
Even up for verbs, adjectives, adverbs

Week 7:

Testvoc clean for all classes
Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
WER <=30%
Bidix-coverage ~45%

Week 8:

Continue working on khk-mvf pair:
Add transfer rules for nouns, pronouns
Add transfer rules for verbs, adjectives, adverbs.
Start working on CG and disambiguation

Deliverable #2, July 9 - 13

Part 3 (weeks 9-12):[edit]

Week 9:

Continue working on disambiguation and its solutions.
Add required transfer/lexical selection rules to improve WER, PER.
Begin with chunking and t3x

Week 10:

Get another ~700 token story for mvf-khk and improve WER.
Target WER <=25%
Regression testing for mvf-khk pair
Evaluate test results, make the required changes, run tests again
User acceptance testing, trying evaluation.

Week 11:

Regression testing for two pairs
Achieve WER < 10% on all previous advanced texts and 3 new advanced texts

Week 12:

Discuss with the mentor about some final changes that must be made.
Detailed analysis on what further improvement could be made for the pairs
Evaluation of results and documentation.

Final evaluation, August 6 - 14

Skills and Qualifications[edit]

The current field of study/major: I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).

Relevant technical skills:: Python, C, C++, Data Structures and Algorithms.

Relevant work experience: I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator.

Languages:

Mongolian (Native Speaker)
Russian (Advanced, Three-time gold medalist of the National Russian Language Olympiads - 2014, 2016, 2017)
English (Advanced)
Turkish (Elementary)
Buryat language (Intermediate)

Non-Summer-of-Code plans you have for the Summer[edit]

My last finals exam is on 15th of May so after May 15 I will spend ~45 hours per week on this project.

@@ Line 10: / Line 10: @@
 == Why is it that you are interested in Apertium? ==
-Apertium is the ideal opportunity for me to contribute something meaningful to the society through active learning and the discovery of my interest.
+Apertium offers the ideal opportunity for me to contribute something meaningful to my Mongolian heritage. Founded on the principles of preserving culture and heritage through language, Apertium connects the realms of the ancient and modern through advances in machine translation platforms.
-I am currently double majoring in Computer Science and Cognitive Science because this is the perfect combination of my love for computer intelligence and my fascination for the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because for me it always seemed that the secret “bridge” to understanding human intelligence was laying there. By working with the Apertium’s machine translation platform not only I will explore this passion of mine but I will galvanize my life purpose “to create and contribute something totally unique to this world”.
+I am currently double majoring in computer science and cognitive science as this is the perfect combination of my love of computer intelligence and my fascination with the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because, to me, it always seems that the secret “bridge” spanning human intelligence lies there. By working with the Apertium’s machine translation platform, not only will I explore this passion of mine, but I will pursue my life purpose “to create and contribute something totally unique to this world”.
-Also one of the biggest reasons I am interested in Apertium is that preserving dying language could potentially help us understand how the human brain can/ categorizes the objects (ways of viewing), how human mind takes or stores information received from the outer world. By studying what all of the world's languages have in common, we potentially can discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.
+I am interested in Apertium because I believe that preserving dying languages could potentially help us understand how the human brain categorizes objects (ways of viewing) and how the human mind takes and stores information. Through the study of the world's languages, we can potentially discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.
 == Which of the published tasks are you interested in? What do you plan to do? ==
@@ Line 22: / Line 22: @@
 Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use.
-One is Mongolian Cyrillic, written horizontally like this:
+One is Mongolian Cyrillic '''(khk)''', written horizontally like this:
-Монгол Кирилл үсэг
+'''Монгол Кирилл үсэг'''
-The other one is the traditional Mongolian script (“Mongol bichig”) written vertically down the page like this:
+The other one is the traditional Mongolian script (“Mongol bichig”, '''mvf''') written vertically down the page like this:
-[File:Monggul-Ulus.png]
+[[File:1.jpg]]
-# Currently, there are no there existing machine translation systems for this pair
+# Currently, there are no existing machine translation systems for this pair
 # They are a super closely related pair. It is just two writing systems for one language
 # There are plenty of resources already existing for this pair including materials written in both
 # There are mentors who can evaluate my work
-The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. The characteristic features of the Mongolian script include it being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes.
+The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. A hallmark of Mongolian script is being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes.
-Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level. Also, once we have the '''mvf-khk''' pair it should not take much time to do khk-mvf pair.
+Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level.
+There is a huge amount of parallel translated (mvf-khk) text with ready a vocabulary and rules (for example [http://www.cjvlang.com/mongol/index.html]).
+[[File:Webp.net-resizeimage (1).png|300px|mvf text]][[File:khk ex.png|250px|khk text]][[File:3.png|300px|vocabulary]][[File:rule ex.png|300px|rule]]
-== How and who will benefit in societe? Why should Google and Apertium sponsor it? ==
+== How and who will benefit in society? Why should Google and Apertium sponsor it? ==
 The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born.
-The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.
+The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the most meaningful intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.
-Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015 under which the traditional Mongolian script would become the national script in 2025. In February 2015, the Mongolian Parliament passed a law on shifting back to the centuries-old national script by 2025. This call has made the whole nation rejoice and been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945 under communist rule.
+Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015, under which the traditional Mongolian script would become the national script in 2025. News of this future law made the whole nation rejoice and has been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945.
-There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. About 21,100 of them are handwritten documents. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping 1,000 years old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.
+There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping millennium old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.
-Development of this language pair will tremendously contribute to the development, print, and distribution new textbooks and education documents, and to provide the necessary training. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, university and secondary school students and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state which existed between 1924 and 1992, but the history from the early 13th century until today. Currently, the materials are kept in inadequate conditions and are in danger of being permanently lost.
+Development of this language pair will tremendously contribute to the development, print, and distribution of new textbooks and educational documents. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, students, and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state, but the history from the early 13th century until today.
-Complete machine translation platform for this pair will be “the passport to” intellectual and cultural heritages and millions of millions invaluable historical documents that only exist in Mongolian Script not only for the younger generation but also for the whole Mongolian and International community.
+A complete machine translation platform for this pair will be “the passport to” Mongolia’s intellectual and cultural heritages not only for the younger generation but also for the whole Mongolian and International communities. Millions of invaluable historical documents that exist solely in Mongolian Script will become readily available to a worldwide audience.
 == Work plan ==
 === Post application period ===
-During the post-application period, the following plan will become more detailed, as I work more closer with the task:
+During the post-application period, the following plan will become more detailed, as I work closer to the task:
 * Diving into Apertium documentation and manuals
@@ Line 61: / Line 67: @@
 * Add/Edit/Expand both the '''mvf''' and '''khk''' dictionaries
 * Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)
 === Community bonding period ===
 * Get monolingual and bilingual aligned corpora for further analysis.
@@ Line 66: / Line 73: @@
 === Work period ===
 <ul>
-==== Part 1, weeks 1-4: ====
+==== Part 1 (weeks 1-4): ====
 <p></p>
 '''Week 1:'''
@@ Line 94: / Line 101: @@
 '''Week 4:'''
 <p></p>
-* Run corpus testing to analyse to improvement.
+* Run corpus testing to analyze the improvement.
 * Improve morphological analyzer
 <li>'''Deliverable #1, June 11 - 15'''</li>
 <p></p>
-==== Part 2, weeks 5-8: ====
+==== Part 2 (weeks 5-8): ====
 <p></p>
 '''Week 5:'''
@@ Line 105: / Line 112: @@
 * Find good parallel corpora and add words in decreasing frequency in apertium-mvf.
 * Coverage ~45%
-* Parallelly improve start working of khk-mvf bilingual dictionary
+* Parallelly start working of '''khk-mvf''' bilingual dictionary
 <p></p>
 '''Week 6:'''
 <p></p>
-* Work on a ~ 500-word story
+* Work on a ~ 700-word story
 * Calculate WER, PER, and document
-* Target WER <=50%
+* Target WER <=40%
 * Even up nouns, pronouns
 * Even up for verbs, adjectives, adverbs
@@ Line 121: / Line 128: @@
 * Testvoc clean for all classes
 * Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
-* WER <=40%
+* WER <=30%
 * Bidix-coverage ~45%
@@ Line 127: / Line 134: @@
 '''Week 8:'''
 <p></p>
-* Continue working on khk-mvf pair:
+* Continue working on '''khk-mvf''' pair:
 * Add transfer rules for nouns, pronouns
 * Add transfer rules for verbs, adjectives, adverbs.
@@ Line 135: / Line 142: @@
 <p></p>
-==== Part 3, weeks 9-12: ====
+==== Part 3 (weeks 9-12): ====
 <p></p>
 '''Week 9:'''
@@ Line 146: / Line 153: @@
 '''Week 10:'''
 <p></p>
-* Get another ~500 token story for '''mvf-khk''' and improve WER.
+* Get another ~700 token story for '''mvf-khk''' and improve WER.
 * Target WER <=25%
 * Regression testing for '''mvf-khk''' pair
@@ Line 170: / Line 177: @@
 == Skills and Qualifications ==
-'''Current field of study/major:''' I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).
+'''The current field of study/major:''' I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).
 '''Relevant technical skills::''' Python, C, C++, Data Structures and Algorithms.
 '''Relevant work experience:''' I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator.
-'''Related Experience:''' Alicante RBML workshop 2016 - implementing Ru-En language pair for Matxin
 '''Languages: '''
@@ Line 187: / Line 192: @@
 == Non-Summer-of-Code plans you have for the Summer ==
-My last finals exam is on 15th of May so after May 15th I will spend ~45 hours per week on this project.
+My last finals exam is on 15th of May so after May 15 I will spend ~45 hours per week on this project.
 [[Category:GSoC 2018 student proposals|Anarsaikhan]]

Difference between revisions of "User:Anarsaikhan"

Latest revision as of 13:51, 28 March 2018

Contents

Contact info[edit]

Why is it that you are interested in Apertium?[edit]

Which of the published tasks are you interested in? What do you plan to do?[edit]

How and who will benefit in society? Why should Google and Apertium sponsor it?[edit]

Work plan[edit]

Post application period[edit]

Community bonding period[edit]

Work period[edit]

Part 1 (weeks 1-4):[edit]

Part 2 (weeks 5-8):[edit]

Part 3 (weeks 9-12):[edit]

Skills and Qualifications[edit]

Non-Summer-of-Code plans you have for the Summer[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools