User:Anarsaikhan

1 Contact info
2 Why is it that you are interested in Apertium?
3 Which of the published tasks are you interested in? What do you plan to do?
4 How and who will benefit in society? Why should Google and Apertium sponsor it?
5 Work plan
6 Skills and Qualifications
7 Non-Summer-of-Code plans you have for the Summer

Contact info

Name: Anarsaikhan Tuvshinjargal

Location: Swarthmore College (Pennsylvania, USA)

E-mail: atuvshi1@swarthmore.edu

Phone number: +1 484-474-7856 (US)

IRC: anarsaikhan

Github: Anarsaikhan

Timezone: UTC-4 (Philadelphia) / UTC +8 (Ulaanbaatar)

Why is it that you are interested in Apertium?

Apertium offers the ideal opportunity for me to contribute something meaningful to my Mongolian heritage. Founded on the principles of preserving culture and heritage through language, Apertium connects the realms of the ancient and modern through advances in machine translation platforms.

I am currently double majoring in computer science and cognitive science as this is the perfect combination of my love of computer intelligence and my fascination with the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because, to me, it always seems that the secret “bridge” spanning human intelligence lies there. By working with the Apertium’s machine translation platform, not only will I explore this passion of mine, but I will pursue my life purpose “to create and contribute something totally unique to this world”.

I am interested in Apertium because I believe that preserving dying languages could potentially help us understand how the human brain categorizes objects (ways of viewing) and how the human mind takes and stores information. Through the study of the world's languages, we can potentially discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.

Which of the published tasks are you interested in? What do you plan to do?

Adopt a new language pair: mvf-khk (Mongolian Script - Mongolian Cyrillic). I want to do translation from Mongolian Script to Mongolian Cyrillic.

Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use.

One is Mongolian Cyrillic (khk), written horizontally like this:

Монгол Кирилл үсэг

The other one is the traditional Mongolian script (“Mongol bichig”, mvf) written vertically down the page like this:

Currently, there are no existing machine translation systems for this pair
They are a super closely related pair. It is just two writing systems for one language
There are plenty of resources already existing for this pair including materials written in both
There are mentors who can evaluate my work

The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. A hallmark of Mongolian script is being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes.

Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level.

There is a huge amount of parallel translated (mvf-khk) text with ready a vocabulary and rules (for example [1]).

How and who will benefit in society? Why should Google and Apertium sponsor it?

The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born.

The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the most meaningful intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.

Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015, under which the traditional Mongolian script would become the national script in 2025. News of this future law made the whole nation rejoice and has been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945.

There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping millennium old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.

Development of this language pair will tremendously contribute to the development, print, and distribution of new textbooks and educational documents. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, students, and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state, but the history from the early 13th century until today.

A complete machine translation platform for this pair will be “the passport to” Mongolia’s intellectual and cultural heritages not only for the younger generation but also for the whole Mongolian and International communities. Millions of invaluable historical documents that exist solely in Mongolian Script will become readily available to a worldwide audience.

Work plan

Post application period

During the post-application period, the following plan will become more detailed, as I work closer to the task:

Diving into Apertium documentation and manuals
Finish Coding challenge with WER~55%
Learn Constraint Grammar and Lexical Selection rules.
Add/Edit/Expand both the mvf and khk dictionaries
Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)

Community bonding period

Get monolingual and bilingual aligned corpora for further analysis.
Learn to use dictionaries and tools in practice.

Work period

Part 1 (weeks 1-4):

Week 1:

Write test scripts
Add transfer rules for nouns, pronouns.
Start working for pronouns, adverbs, and adjectives
Add appropriate rules/stems.
Achieve a WER < 20% for 1 basic text

Week 2:

Add transfer rules for adjectives, adverbs
Take another 500-word story.
Target: WER <50%
Post-edit translated texts. Analyze and look for common rules and add rules

Week 3:

Finish with lexical selection rules and chunking.
Start working on disambiguation and its solutions
Refactoring and documentation.

Week 4:

Run corpus testing to analyze the improvement.
Improve morphological analyzer

Deliverable #1, June 11 - 15

Part 2 (weeks 5-8):

Week 5:

Find good parallel corpora and add words in decreasing frequency in apertium-mvf.
Coverage ~45%
Parallelly start working of khk-mvf bilingual dictionary

Week 6:

Work on a ~ 700-word story
Calculate WER, PER, and document
Target WER <=40%
Even up nouns, pronouns
Even up for verbs, adjectives, adverbs

Week 7:

Testvoc clean for all classes
Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
WER <=30%
Bidix-coverage ~45%

Week 8:

Continue working on khk-mvf pair:
Add transfer rules for nouns, pronouns
Add transfer rules for verbs, adjectives, adverbs.
Start working on CG and disambiguation

Deliverable #2, July 9 - 13

Part 3 (weeks 9-12):

Week 9:

Continue working on disambiguation and its solutions.
Add required transfer/lexical selection rules to improve WER, PER.
Begin with chunking and t3x

Week 10:

Get another ~700 token story for mvf-khk and improve WER.
Target WER <=25%
Regression testing for mvf-khk pair
Evaluate test results, make the required changes, run tests again
User acceptance testing, trying evaluation.

Week 11:

Regression testing for two pairs
Achieve WER < 10% on all previous advanced texts and 3 new advanced texts

Week 12:

Discuss with the mentor about some final changes that must be made.
Detailed analysis on what further improvement could be made for the pairs
Evaluation of results and documentation.

Final evaluation, August 6 - 14

Skills and Qualifications

The current field of study/major: I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).

Relevant technical skills:: Python, C, C++, Data Structures and Algorithms.

Relevant work experience: I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator.

Languages:

Mongolian (Native Speaker)
Russian (Advanced, Three-time gold medalist of the National Russian Language Olympiads - 2014, 2016, 2017)
English (Advanced)
Turkish (Elementary)
Buryat language (Intermediate)

Non-Summer-of-Code plans you have for the Summer

My last finals exam is on 15th of May so after May 15 I will spend ~45 hours per week on this project.

User:Anarsaikhan

Contents

Contact info

Why is it that you are interested in Apertium?

Which of the published tasks are you interested in? What do you plan to do?

How and who will benefit in society? Why should Google and Apertium sponsor it?

Work plan

Post application period

Community bonding period

Work period

Part 1 (weeks 1-4):

Part 2 (weeks 5-8):

Part 3 (weeks 9-12):

Skills and Qualifications

Non-Summer-of-Code plans you have for the Summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools