User:Padth4i/GSoC 2020 Proposal: Improving upon Malayalam English language pair
Contents
Contact details
Name: Vaishnav Sivaprasad
E-mail: vaishnavs4201@gmail.com
IRC: padth4i
Github: https://github.com/padth4i
Location: Kollam, Kerala, India
Timezone: UTC +0530
Why is it that you are interested in Apertium?
One of the reasons I am interested in this project is because of the organisation's open source nature. Each language is spoken by millions of people and is hence constantly evolving. An open source environment means that new words and grammar rules can be integrated by anyone, thus owing to better and well rounded translations. Malayalam is a relatively widely spoken language but there are only a few reliable resources available. This project solves this problem by acting as a collected data set of dictionaries and transfer rules that can be used by other projects for spell-checkers, localisations etc.
Which of the published tasks are you interested in? What do you plan to do?
Title: Apertium translation pair for Malayalam and English I plan to improve the Apertium translation pair for Malayalam and English languages, which is currently in the “Incubator” stage. Malayalam is a Dravidian language spoken commonly in Kerala and the union territories of Lakshadweep and Puducherry, India.
Why Google and Apertium should sponsor it? There is very little translation support on the internet for Malayalam as well as the rest of the Dravidian languages such as Kanada, Tamil and Telugu. A large amount of crowdsourced data can be obtained easily for this language, including loanwords and a large number of synonyms (The Datuk corpus [0] and the Olam English-Malayalam dictionary [1] with over 200,000 crowd-contributed entries. If integrated into Apertium, the translation pair would be more robust and flexible compared to other translation services. It would also provide a basis from which other language pairs including Dravidian languages can be built.
How and who will it benefit in society? Malayalam, like most other Indian languages, is very diverse and has a large number of speakers. The translation services found online very often return incorrect translations which is inconvenient considering the language has around 37 million speakers. The collected data used in this project can also help with the many other open-source projects based in Kerala.
Workplan
Community Bonding
Find and learn from Malayalam grammar resources (dictionaries, grammar rules).
Discuss the deliverables with the community and make any changes in the work plan if necessary
Week 1
Compile cleaned up frequency lists for each part of speech in Malayalam.
Add nouns and adjectives
Week 2
Write transfer rules for nouns and adjectives
Week 3
Add verbs, adverbs, more adjectives and other parts of speech to the bilingual dictionary
Write transfer rules for verbs
Week 4
Write transfer rules for verbs
Run post-translation tests and find the areas to improve
Update documentation
First Evaluation
Deliverables: Cover up most commonly used nouns, verbs and other parts of speech. Should be able to translate simple sentences taken from the James and Mary story from English to Malayalam and vice versa.
Week 5
Even up nouns and adjectives
Week 6
Even up verbs and other parts of speech
Week 7
Extend bilingual dictionary
Week 8
Run post-translation tests.
Update documentation
Second Evaluation
Deliverables: Cover up most words used in an average conversation, as well as borrowed words. Should be able to translate more complex sentences taken from the James and Mary story and two Wikipedia articles from English to Malayalam and vice versa.
Week 9
Extend bilingual dictionary
Work on transfer rules
Week 10
Extend bilingual dictionary
Add multiwords
Work on transfer rules
Week 11
Run final tests
Fix any issues
Week 12
Brush up project and documentation
Prepare for final evaluation
Final Evaluation
Skills
Language skills: Malayalam (Native), English (Advanced), Hindi (Intermediate), Sanskrit (Beginner), Japanese (Beginner)
Programming skills: C/C++, Java, Dart, Flutter, XML. Experienced at writing Bash and Python scripts.
Non-Summer-of-Code plans for the Summer
We will be having our semester exams sometime in May and periodical exams in June.
References
[0]: The Datuk corpus [[1]]
[1]: Olam English-Malayalam dictionary dataset [[2]]