Difference between revisions of "User:Anarsaikhan"

From Apertium
Jump to navigation Jump to search
Line 28: Line 28:
The other one is the traditional Mongolian script (“Mongol bichig”, '''mvf''') written vertically down the page like this:
The other one is the traditional Mongolian script (“Mongol bichig”, '''mvf''') written vertically down the page like this:


[[File:1.jpg]]
[[File:1.jpg]][[File:Webp.net-resizeimage (1).png|200px|mvf text]][[File:khk ex.png|200px|khk text]]


# Currently, there are no there existing machine translation systems for this pair
# Currently, there are no there existing machine translation systems for this pair
Line 40: Line 40:
There is a huge amount of parallel translated (mvf-khk) text with ready vocabulary and rules (for example [http://www.cjvlang.com/mongol/index.html]).
There is a huge amount of parallel translated (mvf-khk) text with ready vocabulary and rules (for example [http://www.cjvlang.com/mongol/index.html]).


[[File:3.png|200px|vocabulary]][[File:rule ex.png|200px|rule]]
[[File:Webp.net-resizeimage (1).png|200px|mvf text]][[File:khk ex.png|200px|khk text]][[File:3.png|200px|vocabulary]][[File:rule ex.png|200px|rule]]


== How and who will benefit in societe? Why should Google and Apertium sponsor it? ==
== How and who will benefit in societe? Why should Google and Apertium sponsor it? ==

Revision as of 02:18, 27 March 2018

Contact info

Name: Anarsaikhan Tuvshinjargal

Location: Swarthmore College (Pennsylvania, USA)

E-mail: atuvshi1@swarthmore.edu

Phone number: +1 484-474-7856 (US)

IRC: anarsaikhan

Github: Anarsaikhan

Timezone: UTC-4 (Philadelphia) / UTC +8 (Ulaanbaatar)

Why is it that you are interested in Apertium?

Apertium is the ideal opportunity for me to contribute something meaningful to the society through active learning and the discovery of my interest.

I am currently double majoring in Computer Science and Cognitive Science because this is the perfect combination of my love for computer intelligence and my fascination for the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because for me it always seemed that the secret “bridge” to understanding human intelligence was laying there. By working with the Apertium’s machine translation platform not only I will explore this passion of mine but I will galvanize my life purpose “to create and contribute something totally unique to this world”.

Also one of the biggest reasons I am interested in Apertium is that preserving dying language could potentially help us understand how the human brain can/ categorizes the objects (ways of viewing), how human mind takes or stores information received from the outer world. By studying what all of the world's languages have in common, we potentially can discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.

Which of the published tasks are you interested in? What do you plan to do?

Adopt a new language pair: mvf-khk (Mongolian Script - Mongolian Cyrillic). I want to do translation from Mongolian Script to Mongolian Cyrillic.

Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use.

One is Mongolian Cyrillic (khk), written horizontally like this:

Монгол Кирилл үсэг

The other one is the traditional Mongolian script (“Mongol bichig”, mvf) written vertically down the page like this:

1.jpg

  1. Currently, there are no there existing machine translation systems for this pair
  2. They are a super closely related pair. It is just two writing systems for one language
  3. There are plenty of resources already existing for this pair including materials written in both
  4. There are mentors who can evaluate my work

The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. The characteristic features of the Mongolian script include it being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes. Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level. Also, once we have the mvf-khk pair it should not take much time to do khk-mvf pair.

There is a huge amount of parallel translated (mvf-khk) text with ready vocabulary and rules (for example [1]).

mvf textkhk textvocabularyrule

How and who will benefit in societe? Why should Google and Apertium sponsor it?

The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born.

The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.

Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015 under which the traditional Mongolian script would become the national script in 2025. In February 2015, the Mongolian Parliament passed a law on shifting back to the centuries-old national script by 2025. This call has made the whole nation rejoice and been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945 under communist rule.

There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. About 21,100 of them are handwritten documents. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping 1,000 years old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.

Development of this language pair will tremendously contribute to the development, print, and distribution new textbooks and education documents, and to provide the necessary training. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, university and secondary school students and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state which existed between 1924 and 1992, but the history from the early 13th century until today. Currently, the materials are kept in inadequate conditions and are in danger of being permanently lost.

Complete machine translation platform for this pair will be “the passport to” intellectual and cultural heritages and millions of millions invaluable historical documents that only exist in Mongolian Script not only for the younger generation but also for the whole Mongolian and International community.

Work plan

Post application period

During the post-application period, the following plan will become more detailed, as I work closer to the task:

  • Diving into Apertium documentation and manuals
  • Finish Coding challenge with WER~55%
  • Learn Constraint Grammar and Lexical Selection rules.
  • Add/Edit/Expand both the mvf and khk dictionaries
  • Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)

Community bonding period

  • Get monolingual and bilingual aligned corpora for further analysis.
  • Learn to use dictionaries and tools in practice.

Work period

    Part 1 (weeks 1-4):

    Week 1:

    • Write test scripts
    • Add transfer rules for nouns, pronouns.
    • Start working for pronouns, adverbs, and adjectives
    • Add appropriate rules/stems.
    • Achieve a WER < 20% for 1 basic text

    Week 2:

    • Add transfer rules for adjectives, adverbs
    • Take another 500-word story.
    • Target: WER <50%
    • Post-edit translated texts. Analyze and look for common rules and add rules

    Week 3:

    • Finish with lexical selection rules and chunking.
    • Start working on disambiguation and its solutions
    • Refactoring and documentation.

    Week 4:

    • Run corpus testing to analyze to improvement.
    • Improve morphological analyzer
  • Deliverable #1, June 11 - 15
  • Part 2 (weeks 5-8):

    Week 5:

    • Find good parallel corpora and add words in decreasing frequency in apertium-mvf.
    • Coverage ~45%
    • Parallelly start working of khk-mvf bilingual dictionary

    Week 6:

    • Work on a ~ 500-word story
    • Calculate WER, PER, and document
    • Target WER <=50%
    • Even up nouns, pronouns
    • Even up for verbs, adjectives, adverbs

    Week 7:

    • Testvoc clean for all classes
    • Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
    • WER <=40%
    • Bidix-coverage ~45%

    Week 8:

    • Continue working on khk-mvf pair:
    • Add transfer rules for nouns, pronouns
    • Add transfer rules for verbs, adjectives, adverbs.
    • Start working on CG and disambiguation
  • Deliverable #2, July 9 - 13
  • Part 3 (weeks 9-12):

    Week 9:

    • Continue working on disambiguation and its solutions.
    • Add required transfer/lexical selection rules to improve WER, PER.
    • Begin with chunking and t3x

    Week 10:

    • Get another ~500 token story for mvf-khk and improve WER.
    • Target WER <=25%
    • Regression testing for mvf-khk pair
    • Evaluate test results, make the required changes, run tests again
    • User acceptance testing, trying evaluation.

    Week 11:

    • Regression testing for two pairs
    • Achieve WER < 10% on all previous advanced texts and 3 new advanced texts

    Week 12:

    • Discuss with the mentor about some final changes that must be made.
    • Detailed analysis on what further improvement could be made for the pairs
    • Evaluation of results and documentation.
  • Final evaluation, August 6 - 14

Skills and Qualifications

The current field of study/major: I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).

Relevant technical skills:: Python, C, C++, Data Structures and Algorithms.

Relevant work experience: I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator.

Languages:

  • Mongolian (Native Speaker)
  • Russian (Advanced, Three-time gold medalist of the National Russian Language Olympiads - 2014, 2016, 2017)
  • English (Advanced)
  • Turkish (Elementary)
  • Buryat language (Intermediate)

Non-Summer-of-Code plans you have for the Summer

My last finals exam is on 15th of May so after May I will spend ~45 hours per week on this project.