User:Padth4i/GSoC 2020 Proposal: Improving upon Malayalam English language pair

From Apertium
Jump to navigation Jump to search

Contact details[edit]

Name: Vaishnav Sivaprasad

E-mail: vaishnavs4201@gmail.com

IRC: padth4i

Github: https://github.com/padth4i

Location: Kollam, Kerala, India

Timezone: UTC +0530

Why is it that you are interested in Apertium?[edit]

One of the reasons I am interested in this project is because of the organisation's open source nature. Each language is spoken by millions of people and is hence constantly evolving. An open source environment means that new words and grammar rules can be integrated by anyone, thus owing to better and well rounded translations. Malayalam is a relatively widely spoken language but there are only a few reliable resources available. This project solves this problem by acting as a collected data set of dictionaries and transfer rules that can be used by other projects for spell-checkers, localisations etc.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title: Apertium translation pair for Malayalam and English I plan to improve the Apertium translation pair for Malayalam and English languages, which is currently in the “Incubator” stage. Malayalam is a Dravidian language spoken commonly in Kerala and the union territories of Lakshadweep and Puducherry, India.

Why Google and Apertium should sponsor it? There is very little translation support on the internet for Malayalam as well as the rest of the Dravidian languages such as Kanada, Tamil and Telugu. A large amount of crowdsourced data can be obtained easily for this language, including loanwords and a large number of synonyms (The Datuk corpus [0] and the Olam English-Malayalam dictionary [1] with over 200,000 crowd-contributed entries. If integrated into Apertium, the translation pair would be more robust and flexible compared to other translation services. It would also provide a basis from which other language pairs including Dravidian languages can be built.

How and who will it benefit in society? Malayalam, like most other Indian languages, is very diverse and has a large number of speakers. The translation services found online very often return incorrect translations which is inconvenient considering the language has around 37 million speakers. The collected data used in this project can also help with the many other open-source projects based in Kerala.

Workplan[edit]

Community Bonding

Find and learn from Malayalam grammar resources (dictionaries, grammar rules).

Discuss the deliverables with the community and make any changes in the work plan if necessary


Week 1

Compile cleaned up frequency lists for each part of speech in Malayalam.

Add nouns and adjectives

Week 2

Write transfer rules for nouns and adjectives

Week 3

Add verbs, adverbs, more adjectives and other parts of speech to the bilingual dictionary

Write transfer rules for verbs

Week 4

Write transfer rules for verbs

Run post-translation tests and find the areas to improve

Update documentation


First Evaluation

Deliverables: Cover up most commonly used nouns, verbs and other parts of speech. Should be able to translate simple sentences taken from the James and Mary story from English to Malayalam and vice versa.


Week 5

Even up nouns and adjectives

Week 6

Even up verbs and other parts of speech

Week 7

Extend bilingual dictionary

Week 8

Run post-translation tests.

Update documentation


Second Evaluation

Deliverables: Cover up most words used in an average conversation, as well as borrowed words. Should be able to translate more complex sentences taken from the James and Mary story and two Wikipedia articles from English to Malayalam and vice versa.


Week 9

Extend bilingual dictionary

Work on transfer rules

Week 10

Extend bilingual dictionary

Add multiwords

Work on transfer rules

Week 11

Run final tests

Fix any issues

Week 12

Brush up project and documentation

Prepare for final evaluation


Final Evaluation

Skills[edit]

Language skills: Malayalam (Native), English (Advanced), Hindi (Intermediate), Sanskrit (Beginner), Japanese (Beginner)

Programming skills: C/C++, Java, Dart, Flutter, XML. Experienced at writing Bash and Python scripts.

Non-Summer-of-Code plans for the Summer[edit]

We will be having our semester exams sometime in May and periodical exams in June.

References[edit]

[0]: The Datuk corpus [[1]]

[1]: Olam English-Malayalam dictionary dataset [[2]]