User:Amanmehta/Application

From Apertium
Jump to navigation Jump to search

Contact details

Name: Aman Mehta
E-mail: amanmehta1997@gmail.com
Svn: aman-mehta
IRC-nick: amanmehta
Mobile: +91 8329139961
Timezone: UTC+05:30
Github link: https://github.com/amanmehta-maniac
I stay online on IRC for most of my time so as to be easily accessible


Interest in machine translation

I am passionate about computers. Automation of tasks such as translation fascinates me. The core problem that translation of a text from one language to other can’t be solved by simple substitution of words, catches my interest. The idea of building a translation system and automating translation intrigues me. As MT gives people opportunity to access knowledge in multiple languages, it can play a pivotal role in education for all mission, not only in India but also across the globe. It adheres to the idea that knowledge should be free and accessible to all, which even I believe in strongly. Working on and around machine translation, serves my interest as well as my motivation.

Interested published tasks and project goals

I plan to “Adopt an unreleased language pair”, or to be precise, three language pairs: mar-hin, guj-hin, mar-guj. Mar-hin and guj-hin pairs are in incubator and mar-guj pair is still unreleased. My goal is to bring incubator pair: mar-hin and an unreleased pair: mar-guj, both to release quality. I also plan on expanding dictionaries for guj-hin pair and make further improvements to coverage and WER to the extent possible.


Interest in Apertium

Given my interest in machine translation, I decided to contribute to Apertium and enjoy adding my contribution to Apertium. I developed my interest in Apertium project in last couple of months during which I spent my time on resolving few svn bugs as well as on improving mar-hin pair. It is, at present, one of the best open-source machine translation platforms. Spending my summer to work for this platform would give me an opportunity to add my contribution in an area that fascinates me.


Reasons for Google and Apertium to sponsor

The mar-hin and mar-guj pairs can be brought to a production quality without much effort due to lexical similarities. I am very well acquainted with apertium as well as with the language pairs I am proposing to work on. The odds of finding a polyglot who could add these pairs to Apertium in a single summer would probably be low. If successful, this would add a couple of more language pairs to Apertium which would triple the number of Indian language pairs. The release of these pairs could also help Apertium in expanding language pairs for many other Indian languages. It has been ~2 months since I have joined Apertium and I am very much familiar with it. I have fixed quite a few bugs on svn. I have been working around mar-hin pair and I have been successful in adding coverage for adverbs and adjectives by scraping <avy> tags. It has been around a month and hence I have a very good gist on what all is needed to bring this pair to release quality. For detailed information about my tasks completed, refer to the section “Tasks completed till date”.


Who it will benefit in society and how

  • Who?
    • Over 70 million Marathi speakers
    • Over 50 million Gujarati speakers
    • People belonging to non-native lingual state
      • Eg: A gujarati speaker in Maharashtra (like myself)
  • How?
    • Translator available to learn languages
    • Access to Hindi information
      • Hindi media/newspapers
    • Improved coverage of Hindi books to Marathi and Gujarati and vice-versa.

Eventually helping people of different native languages to share space and reduce communication gap.


Workplan

I plan to work on mar-hin pair for which I have already started working. My goal is to develop mar-hin pair to close to release quality for roughly the first month. Mar-hin pair is already decent in the mono dictionaries and morphological analyzers and hence in this one month I would focus on bilingual dictionary, building a good translator, adding transfer and lexical selection rules.

  • Target WER<=20%
  • Target coverage~70%

For the remaining two months, my main focus will be to release the unreleased mar-guj pair and develop it to close to release quality and parallely expand and improve dictionaries for guj-hin pair, as much as time permits.
For mar-guj pair:

  • Target WER <=20%
  • Target coverage~65%

Detailed week-wise workplan

week dates goals eval accomplishments notes
post-application period
22 March - 20 April

Dates 22 March - 20 April

  • Reduce WER as much as possible. (target ~60%)
  • Run Tests on other datasets/articles.
  • Read Apertium’s documentation
community bonding period
21 April - 19 May
  • Study the rules of language pair Hindi-English thoroughly.
  • Prepare a corpora.
  • Document current coverage and state of the system.
  • Testvoc, regression test
1 19 - 24 May
  • Increase in coverage of 1% over Wikipedia (>= 84.7%)
  • Decrease in error rate 2% over test corpora
  • Testvoc clean in all classes.
2 25 - 31 May
  • Continue cleanup for closed class words.
  • Select a corpus, clean it, divide it into segments (test and dev)
  • Start post-editing.
3 1 - 7 June
  • Finish post-editing.
  • Generate frequency table of the corpus
  • Calculate coverage of morph analyser and bilingual dictionary.
4 8 - 14 June
  • Expand dictionaries by adding unknown words.
  • Calculate increment in coverage.
  • Evaluate the output (WER/PER)
Deliverable #1
15 June

Dictionary, corpus, post-edits, coverage and evaluation results.

5 15 - 21 June
  • Expand dictionaries by adding unknown words.
  • Calculate increment in coverage.
  • Evaluate the output (WER/PER)
midterm eval
23 - 27 June
6 29 June - 5 July
  • Analysis of post-edits, generate common rules for most of the sentences.
  • Test for (@,#,*)-appended tokens (if any) and remove them.
  • Add lexical selection rules.
7 6 - 12 July
  • Calculate new WER/PER.
  • Reduce PER < 60%, document the results.
  • Start working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis.
8 13 - 19 July
  • Continue working on t1x.
  • Test the rules, new output and evaluate the whole experimentation.
  • Clean up the post-edits
  • tart working on chunking (t2x)
Deliverable #2
20 July

Report of post-edit analysis, test results of experimentation on rules (change in WER/PER), expanded dictionaries and coverage.

9 20 - 26 July
  • Finish working on t2x.
  • Work upon t3x and t4x.
  • Read Constraint Grammar (for implementing context based lexical selection)
10 27 July - 2 August

Start looking into Constraint Grammar and disambiguation rules.

11 3 - 10 August
  • Continue working on CG + disambiguation rules.
  • Testing on post-edits, document the results, draw comparison between system output before and after the changes were made.
12 11 - 18 August
  • Evaluate test results, make the required changes, run tests again.
  • Open system for beta-release and have a native speaker test it.
  • Documentation of results.
pencils-down week
final evaluation
18 August - 22 August
  • Documentation, Evaluation, Refactoring

Skill

I am a second year student, pursuing BTech and MS by research in Computer Science at International Institute of Information Technology. I have proficient knowledge about Databases, Computer Programming, Data Structures, Algorithms and Artificial Intelligence. I am very comfortable with python, XML, bash scripting and C++. I am a code-enthusiast and can dedicate and focus for long hours when it comes to coding. I have gone through Machine Translation course mentioned on the wiki and played around mar-hin pair to get gist of how machine translation works at Apertium. I have contributed to Mozilla in past, and can work with large files. At the linguistic side, I can read and write Hindi very well. I am a native Gujarati and can converse, read & write in Marathi well.

Non GSoC Commitment

I don’t have many other commitments for this summer and I can spend ~40 hours per week for this project. My college curriculum will start around in July end, but I will still be able to dedicate at least 30 hours per week. Maintaining, on an average, at least 35 hours per week for the complete summer.


Tasks completed till date

  • Set up Apertium environment, solved a few bugs on svn to get a good feel of what kind of issues occur
  • Found major loopholes in mar-hin pair, namely
    • No rules to handle transitive and intransitive verbs
    • Pronouns missing
    • Adjectives and adverbs tagged and mapped incorrectly
    • Many basic nouns missing in marathi monolingual dictionary and correspondingly in bilingual dictionary
  • Scraped down the <avy> tags to corresponding <adv> and <adj> tags in bilingual dictionary, improving coverage for all adjectives and adverbs(which improved coverage by ~5.5% on wikipedia corpus)
  • Added and corrected tags of some very common adjectives which had been wrongly tagged as adverbs
  • Coding challenge:
    • Analysis:
      • Only 27% known tokens
      • About ~20% of unknown words are intransitive verbs
      • About ~15% of unknown words are pronouns and their lexicals
    • Listing intransitive verbs and adding transfer rules to handle transitive/intransitive verbs. (ongoing)
    • Rules to handle pronouns. (few added, more to add)
    • Initial Status:
      • WER: ~87%
      • PER: ~84%
      • Coverage: ~27%
    • Current Status (ongoing):
      • WER: ~74%
      • PER: ~69%
      • Coverage: 74.5%
  • Improved marathi bidix coverage for a wikipedia corpus (~3.6million tokens) from 34% to 49%. Reduced the count of unknown tokens by ~53k