Difference between revisions of "User:Anarsaikhan"
Anarsaikhan (talk | contribs) |
Anarsaikhan (talk | contribs) |
||
(37 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
== Why is it that you are interested in Apertium? == |
== Why is it that you are interested in Apertium? == |
||
Apertium |
Apertium offers the ideal opportunity for me to contribute something meaningful to my Mongolian heritage. Founded on the principles of preserving culture and heritage through language, Apertium connects the realms of the ancient and modern through advances in machine translation platforms. |
||
I am currently double majoring in |
I am currently double majoring in computer science and cognitive science as this is the perfect combination of my love of computer intelligence and my fascination with the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because, to me, it always seems that the secret “bridge” spanning human intelligence lies there. By working with the Apertium’s machine translation platform, not only will I explore this passion of mine, but I will pursue my life purpose “to create and contribute something totally unique to this world”. |
||
I am interested in Apertium because I believe that preserving dying languages could potentially help us understand how the human brain categorizes objects (ways of viewing) and how the human mind takes and stores information. Through the study of the world's languages, we can potentially discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce. |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
== Which of the published tasks are you interested in? What do you plan to do? == |
||
Line 22: | Line 22: | ||
Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use. |
Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use. |
||
One is Mongolian Cyrillic, written horizontally like this: |
One is Mongolian Cyrillic '''(khk)''', written horizontally like this: |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
[File:Monggul-Ulus.png] |
|||
[[File:1.jpg]] |
|||
⚫ | |||
⚫ | |||
# They are a super closely related pair. It is just two writing systems for one language |
# They are a super closely related pair. It is just two writing systems for one language |
||
# There are plenty of resources already existing for this pair including materials written in both |
# There are plenty of resources already existing for this pair including materials written in both |
||
# There are mentors who can evaluate my work |
# There are mentors who can evaluate my work |
||
The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. |
The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. A hallmark of Mongolian script is being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes. |
||
Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level |
Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level. |
||
There is a huge amount of parallel translated (mvf-khk) text with ready a vocabulary and rules (for example [http://www.cjvlang.com/mongol/index.html]). |
|||
[[File:Webp.net-resizeimage (1).png|300px|mvf text]][[File:khk ex.png|250px|khk text]][[File:3.png|300px|vocabulary]][[File:rule ex.png|300px|rule]] |
|||
== How and who will benefit in |
== How and who will benefit in society? Why should Google and Apertium sponsor it? == |
||
The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born. |
The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born. |
||
The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all. |
The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the most meaningful intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all. |
||
Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015 under which the traditional Mongolian script would become the national script in 2025. |
Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015, under which the traditional Mongolian script would become the national script in 2025. News of this future law made the whole nation rejoice and has been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945. |
||
There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia |
There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping millennium old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them. |
||
Development of this language pair will tremendously contribute to the development, print, and distribution new textbooks and |
Development of this language pair will tremendously contribute to the development, print, and distribution of new textbooks and educational documents. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, students, and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state, but the history from the early 13th century until today. |
||
A complete machine translation platform for this pair will be “the passport to” Mongolia’s intellectual and cultural heritages not only for the younger generation but also for the whole Mongolian and International communities. Millions of invaluable historical documents that exist solely in Mongolian Script will become readily available to a worldwide audience. |
|||
== Work plan == |
== Work plan == |
||
=== Post application period === |
=== Post application period === |
||
During the post-application period, the following plan will become more detailed, as I work |
During the post-application period, the following plan will become more detailed, as I work closer to the task: |
||
* Diving into Apertium documentation and manuals |
* Diving into Apertium documentation and manuals |
||
Line 61: | Line 67: | ||
* Add/Edit/Expand both the '''mvf''' and '''khk''' dictionaries |
* Add/Edit/Expand both the '''mvf''' and '''khk''' dictionaries |
||
* Analyse opportunity to improve dictionaries (tag editing/expand dictionaries) |
* Analyse opportunity to improve dictionaries (tag editing/expand dictionaries) |
||
=== Community bonding period === |
=== Community bonding period === |
||
* Get monolingual and bilingual aligned corpora for further analysis. |
* Get monolingual and bilingual aligned corpora for further analysis. |
||
Line 66: | Line 73: | ||
=== Work period === |
=== Work period === |
||
<ul> |
<ul> |
||
==== Part 1 |
==== Part 1 (weeks 1-4): ==== |
||
<p></p> |
<p></p> |
||
'''Week 1:''' |
'''Week 1:''' |
||
Line 94: | Line 101: | ||
'''Week 4:''' |
'''Week 4:''' |
||
<p></p> |
<p></p> |
||
* Run corpus testing to |
* Run corpus testing to analyze the improvement. |
||
* Improve morphological analyzer |
* Improve morphological analyzer |
||
<li>'''Deliverable #1, June 11 - 15'''</li> |
<li>'''Deliverable #1, June 11 - 15'''</li> |
||
<p></p> |
<p></p> |
||
==== Part 2 |
==== Part 2 (weeks 5-8): ==== |
||
<p></p> |
<p></p> |
||
'''Week 5:''' |
'''Week 5:''' |
||
Line 105: | Line 112: | ||
* Find good parallel corpora and add words in decreasing frequency in apertium-mvf. |
* Find good parallel corpora and add words in decreasing frequency in apertium-mvf. |
||
* Coverage ~45% |
* Coverage ~45% |
||
* Parallelly |
* Parallelly start working of '''khk-mvf''' bilingual dictionary |
||
<p></p> |
<p></p> |
||
'''Week 6:''' |
'''Week 6:''' |
||
<p></p> |
<p></p> |
||
* Work on a ~ |
* Work on a ~ 700-word story |
||
* Calculate WER, PER, and document |
* Calculate WER, PER, and document |
||
* Target WER <= |
* Target WER <=40% |
||
* Even up nouns, pronouns |
* Even up nouns, pronouns |
||
* Even up for verbs, adjectives, adverbs |
* Even up for verbs, adjectives, adverbs |
||
Line 121: | Line 128: | ||
* Testvoc clean for all classes |
* Testvoc clean for all classes |
||
* Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis |
* Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis |
||
* WER <= |
* WER <=30% |
||
* Bidix-coverage ~45% |
* Bidix-coverage ~45% |
||
Line 127: | Line 134: | ||
'''Week 8:''' |
'''Week 8:''' |
||
<p></p> |
<p></p> |
||
* Continue working on khk-mvf pair: |
* Continue working on '''khk-mvf''' pair: |
||
* Add transfer rules for nouns, pronouns |
* Add transfer rules for nouns, pronouns |
||
* Add transfer rules for verbs, adjectives, adverbs. |
* Add transfer rules for verbs, adjectives, adverbs. |
||
Line 135: | Line 142: | ||
<p></p> |
<p></p> |
||
==== Part 3 |
==== Part 3 (weeks 9-12): ==== |
||
<p></p> |
<p></p> |
||
'''Week 9:''' |
'''Week 9:''' |
||
Line 146: | Line 153: | ||
'''Week 10:''' |
'''Week 10:''' |
||
<p></p> |
<p></p> |
||
* Get another ~ |
* Get another ~700 token story for '''mvf-khk''' and improve WER. |
||
* Target WER <=25% |
* Target WER <=25% |
||
* Regression testing for '''mvf-khk''' pair |
* Regression testing for '''mvf-khk''' pair |
||
Line 170: | Line 177: | ||
== Skills and Qualifications == |
== Skills and Qualifications == |
||
''' |
'''The current field of study/major:''' I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA). |
||
'''Relevant technical skills::''' Python, C, C++, Data Structures and Algorithms. |
'''Relevant technical skills::''' Python, C, C++, Data Structures and Algorithms. |
||
'''Relevant work experience:''' I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator. |
'''Relevant work experience:''' I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator. |
||
'''Related Experience:''' Alicante RBML workshop 2016 - implementing Ru-En language pair for Matxin |
|||
'''Languages: ''' |
'''Languages: ''' |
||
Line 187: | Line 192: | ||
== Non-Summer-of-Code plans you have for the Summer == |
== Non-Summer-of-Code plans you have for the Summer == |
||
My last finals exam is on 15th of May so after May |
My last finals exam is on 15th of May so after May 15 I will spend ~45 hours per week on this project. |
||
[[Category:GSoC 2018 student proposals|Anarsaikhan]] |
[[Category:GSoC 2018 student proposals|Anarsaikhan]] |
Latest revision as of 13:51, 28 March 2018
Contents
- 1 Contact info
- 2 Why is it that you are interested in Apertium?
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 How and who will benefit in society? Why should Google and Apertium sponsor it?
- 5 Work plan
- 6 Skills and Qualifications
- 7 Non-Summer-of-Code plans you have for the Summer
Contact info[edit]
Name: Anarsaikhan Tuvshinjargal
Location: Swarthmore College (Pennsylvania, USA)
E-mail: atuvshi1@swarthmore.edu
Phone number: +1 484-474-7856 (US)
IRC: anarsaikhan
Github: Anarsaikhan
Timezone: UTC-4 (Philadelphia) / UTC +8 (Ulaanbaatar)
Why is it that you are interested in Apertium?[edit]
Apertium offers the ideal opportunity for me to contribute something meaningful to my Mongolian heritage. Founded on the principles of preserving culture and heritage through language, Apertium connects the realms of the ancient and modern through advances in machine translation platforms.
I am currently double majoring in computer science and cognitive science as this is the perfect combination of my love of computer intelligence and my fascination with the human mind. I like to imagine every individual as a different galaxy with their own stars, planets and weird things. And I have always been curious about studying new languages and their structures because, to me, it always seems that the secret “bridge” spanning human intelligence lies there. By working with the Apertium’s machine translation platform, not only will I explore this passion of mine, but I will pursue my life purpose “to create and contribute something totally unique to this world”.
I am interested in Apertium because I believe that preserving dying languages could potentially help us understand how the human brain categorizes objects (ways of viewing) and how the human mind takes and stores information. Through the study of the world's languages, we can potentially discover what is and isn't possible in a human language. This, in turn, tells us important things about the human mind. The fewer languages there are to study, the less we will be able to learn about the human mind and the full range of complexity and structures it can produce.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Adopt a new language pair: mvf-khk (Mongolian Script - Mongolian Cyrillic). I want to do translation from Mongolian Script to Mongolian Cyrillic.
Mongolia officially uses two alphabets – the traditional Mongolian script for government documents and Cyrillic for everyday use.
One is Mongolian Cyrillic (khk), written horizontally like this:
Монгол Кирилл үсэг
The other one is the traditional Mongolian script (“Mongol bichig”, mvf) written vertically down the page like this:
- Currently, there are no existing machine translation systems for this pair
- They are a super closely related pair. It is just two writing systems for one language
- There are plenty of resources already existing for this pair including materials written in both
- There are mentors who can evaluate my work
The script has stood the test of time henceforth and has served as a unifying factor for various Mongol speaking ethnic groups. A hallmark of Mongolian script is being the only vertical script in human history that is written from left to right. All other vertical writing systems are written right to left (Wikipedia). It is an easy and speedy way of documenting what is spoken orally and has, over centuries, produced a great number of variations to be used for different purposes.
Mongolian Cyrillic and Mongolian Script both have the same syntactic structure, so they have the exact structure in terms of sentence level and only vary within a word level.
There is a huge amount of parallel translated (mvf-khk) text with ready a vocabulary and rules (for example [1]).
How and who will benefit in society? Why should Google and Apertium sponsor it?[edit]
The study of languages and literature, including one’s mother tongue, are part and parcel of what we mean by the full development of the human personality. Developing expertise in one’s mother tongue serves as “the passport to life” in the community in which one was born.
The traditional Mongolian script, originating in the early 13th century by the order of Chinggis Khan, is one of the most meaningful intellectual and cultural heritages of the Mongolian people. Most handwritten and printed books, administrative papers, family tree records, fairy tales, legends and Buddhist manuscripts were produced in the traditional Mongolian script until 1945 when Mongolia painfully and forcefully shifted to Russian Cyrillic alphabet under the Kremlin’s order. When in 1990, the democratic peaceful revolution overthrew socialist system, Mongolia eventually moved towards democratic present-day Mongolia and the writing of the new constitution. In September 1992, education in the Mongolian script began from the first year of primary school but unfortunately, most people felt it was too difficult to learn and hence not worthwhile after all.
Today, although the Cyrillic script is used nationwide, all the government seals starting with the Head of State, the Prime Minister and Cabinet ministers are all in the Mongolian script, which is a matter of pride. While visiting the Government Palace, I met with the Minister of Education and Science who informed us that a law on Mongolian script was passed in February 2015, under which the traditional Mongolian script would become the national script in 2025. News of this future law made the whole nation rejoice and has been met by the public with tremendous enthusiasm to learn and master the national script banned and purged since 1945.
There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. There are many more manuscripts and books in traditional Mongolian script stored in libraries of other countries such as China, Russia, and Germany. Despite the importance of keeping millennium old historical materials in good conditions, the Mongolian environments for material storage are not satisfactory to keep historical records for a long period of time. I believe that the most efficient and effective way to keep and protect these invaluable historical materials while digitizing and creating a digital library is to make them publicly available and help the younger community to actively engage with them.
Development of this language pair will tremendously contribute to the development, print, and distribution of new textbooks and educational documents. As the target materials have long been inaccessible to many and neglected under the communist rule, it is of great interest and importance lately to scholars, historians, researchers, students, and to general public for unveiling the true history of the nation not only when it was a unitary sovereign socialist state, but the history from the early 13th century until today.
A complete machine translation platform for this pair will be “the passport to” Mongolia’s intellectual and cultural heritages not only for the younger generation but also for the whole Mongolian and International communities. Millions of invaluable historical documents that exist solely in Mongolian Script will become readily available to a worldwide audience.
Work plan[edit]
Post application period[edit]
During the post-application period, the following plan will become more detailed, as I work closer to the task:
- Diving into Apertium documentation and manuals
- Finish Coding challenge with WER~55%
- Learn Constraint Grammar and Lexical Selection rules.
- Add/Edit/Expand both the mvf and khk dictionaries
- Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)
Community bonding period[edit]
- Get monolingual and bilingual aligned corpora for further analysis.
- Learn to use dictionaries and tools in practice.
Work period[edit]
- Write test scripts
- Add transfer rules for nouns, pronouns.
- Start working for pronouns, adverbs, and adjectives
- Add appropriate rules/stems.
- Achieve a WER < 20% for 1 basic text
- Add transfer rules for adjectives, adverbs
- Take another 500-word story.
- Target: WER <50%
- Post-edit translated texts. Analyze and look for common rules and add rules
- Finish with lexical selection rules and chunking.
- Start working on disambiguation and its solutions
- Refactoring and documentation.
- Run corpus testing to analyze the improvement.
- Improve morphological analyzer
- Deliverable #1, June 11 - 15
- Find good parallel corpora and add words in decreasing frequency in apertium-mvf.
- Coverage ~45%
- Parallelly start working of khk-mvf bilingual dictionary
- Work on a ~ 700-word story
- Calculate WER, PER, and document
- Target WER <=40%
- Even up nouns, pronouns
- Even up for verbs, adjectives, adverbs
- Testvoc clean for all classes
- Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
- WER <=30%
- Bidix-coverage ~45%
- Continue working on khk-mvf pair:
- Add transfer rules for nouns, pronouns
- Add transfer rules for verbs, adjectives, adverbs.
- Start working on CG and disambiguation
- Deliverable #2, July 9 - 13
- Continue working on disambiguation and its solutions.
- Add required transfer/lexical selection rules to improve WER, PER.
- Begin with chunking and t3x
- Get another ~700 token story for mvf-khk and improve WER.
- Target WER <=25%
- Regression testing for mvf-khk pair
- Evaluate test results, make the required changes, run tests again
- User acceptance testing, trying evaluation.
- Regression testing for two pairs
- Achieve WER < 10% on all previous advanced texts and 3 new advanced texts
- Discuss with the mentor about some final changes that must be made.
- Detailed analysis on what further improvement could be made for the pairs
- Evaluation of results and documentation.
- Final evaluation, August 6 - 14
Part 1 (weeks 1-4):[edit]
Week 1:
Week 2:
Week 3:
Week 4:
Part 2 (weeks 5-8):[edit]
Week 5:
Week 6:
Week 7:
Week 8:
Part 3 (weeks 9-12):[edit]
Week 9:
Week 10:
Week 11:
Week 12:
Skills and Qualifications[edit]
The current field of study/major: I am currently double majoring in Computer Science and Cognitive Science at Swarthmore College (Pennsylvania, USA).
Relevant technical skills:: Python, C, C++, Data Structures and Algorithms.
Relevant work experience: I have worked at “Gazar Ord” LLC (Ulaanbaatar, Mongolia) for 46 months as an assistant data analyst and a Russian language translator.
Languages:
- Mongolian (Native Speaker)
- Russian (Advanced, Three-time gold medalist of the National Russian Language Olympiads - 2014, 2016, 2017)
- English (Advanced)
- Turkish (Elementary)
- Buryat language (Intermediate)
Non-Summer-of-Code plans you have for the Summer[edit]
My last finals exam is on 15th of May so after May 15 I will spend ~45 hours per week on this project.