Difference between revisions of "User:Amanmehta/Application"

From Apertium
Jump to navigation Jump to search
 
(75 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
[[Category:GSoC_2017_Student_Proposals]]
=='''Contact details'''==
 
  +
== Contact details ==
Name: Aman Mehta<br />
 
  +
'''Name''': Aman Mehta<br />
E-mail: amanmehta1997@gmail.com<br />
 
  +
'''E-mail''': amanmehta1997@gmail.com<br />
Svn: aman-mehta<br />
 
IRC-nick: amanmehta<br />
+
'''Svn''': aman-mehta<br />
Mobile: +91 8329139961<br />
+
'''IRC-nick''': amanmehta<br />
Timezone: UTC+05:30<br />
+
'''Mobile''': +91 8329139961<br />
  +
'''Timezone''': UTC+05:30<br />
Github link: https://github.com/amanmehta-maniac<br />
 
  +
'''Github link''': https://github.com/amanmehta-maniac<br />
 
I stay online on IRC for most of my time so as to be easily accessible <br />
 
I stay online on IRC for most of my time so as to be easily accessible <br />
   
  +
==Interest in machine translation==
 
== '''Interest in machine translation''' ==
 
 
I am passionate about computers. Automation of tasks such as translation fascinates me. The core problem that translation of a text from one language to other can’t be solved by simple substitution of words, catches my interest. The idea of building a translation system and automating translation intrigues me. As MT gives people opportunity to access knowledge in multiple languages, it can play a pivotal role in education for all mission, not only in India but also across the globe. It adheres to the idea that knowledge should be free and accessible to all, which even I believe in strongly. Working on and around machine translation, serves my interest as well as my motivation. <br />
 
I am passionate about computers. Automation of tasks such as translation fascinates me. The core problem that translation of a text from one language to other can’t be solved by simple substitution of words, catches my interest. The idea of building a translation system and automating translation intrigues me. As MT gives people opportunity to access knowledge in multiple languages, it can play a pivotal role in education for all mission, not only in India but also across the globe. It adheres to the idea that knowledge should be free and accessible to all, which even I believe in strongly. Working on and around machine translation, serves my interest as well as my motivation. <br />
   
== '''Interested published tasks and project goals''' ==
+
== Interested published tasks and project goals ==
  +
I plan to “Adopt an unreleased language pair”, or to be precise, three language pairs: mar-hin, guj-hin, mar-guj. Mar-hin and guj-hin pairs are in incubator and mar-guj pair is still unreleased. My goal is to bring incubator pair: mar-hin and an unreleased pair: mar-guj, both to release quality, with WER <=20% and bilingual dictionary coverage >=70% for both pairs. I also plan on expanding dictionaries for guj-hin pair and make further improvements to coverage and WER (Targetting it to be <40%) to the extent possible.
 
I plan to “Adopt an unreleased language pair”, or to be precise, three language pairs: mar-hin, guj-hin, mar-guj. Mar-hin and guj-hin pairs are in incubator and mar-guj pair is still unreleased. My goal is to bring incubator pair: mar-hin and an unreleased pair: mar-guj, both to release quality. I also plan on expanding dictionaries for guj-hin pair and make further improvements to coverage and WER to the extent possible.
 
 
 
== '''Interest in Apertium''' ==
 
 
Given my interest in machine translation, I decided to contribute to Apertium and enjoy adding my contribution to Apertium. I developed my interest in Apertium project in last couple of months during which I spent my time on resolving few svn bugs as well as on improving mar-hin pair. It is, at present, one of the best open-source machine translation platforms. Spending my summer to work for this platform would give me an opportunity to add my contribution in an area that fascinates me.
 
 
 
== '''Reasons for Google and Apertium to sponsor''' ==
 
 
The mar-hin and mar-guj pairs can be brought to a production quality without much effort due to lexical similarities. I am very well acquainted with apertium as well as with the language pairs I am proposing to work on. The odds of finding a polyglot who could add these pairs to Apertium in a single summer would probably be low. If successful, this would add a couple of more language pairs to Apertium which would triple the number of Indian language pairs. The release of these pairs could also help Apertium in expanding language pairs for many other Indian languages.
 
It has been ~2 months since I have joined Apertium and I am very much familiar with it. I have fixed quite a few bugs on svn. I have been working around mar-hin pair and I have been successful in adding coverage for adverbs and adjectives by scraping <avy> tags. It has been around a month and hence I have a very good gist on what all is needed to bring this pair to release quality. For detailed information about my tasks completed, refer to the section “Tasks completed till date”.
 
   
  +
==Interest in Apertium==
  +
Given my interest in machine translation, I decided to contribute to Apertium and enjoy adding my contribution to Apertium. I developed my interest in Apertium project in last couple of months during which I spent my time on resolving few svn bugs as well as on improving mar-hin pair. Going through the MT course first and then improving mar-hin pair, brought my interest to developing language pairs for Apertium. It is, at present, one of the best open-source machine translation platforms. Spending my summer to work for this platform would give me an opportunity to add my contribution in an area that fascinates me.
   
  +
==Reasons for Google and Apertium to sponsor==
== '''Who it will benefit in society and how''' ==
 
  +
The mar-hin and mar-guj pairs can be brought to release quality without much effort due to lexical similarities. I am very well acquainted with apertium as well as with the language pairs I am proposing to work on. The odds of finding a polyglot who could add these pairs to Apertium in a single summer would probably be low. If successful, this would add a couple of more language pairs to Apertium which would triple the number of Indian language pairs. The release of these pairs could also help Apertium in expanding language pairs for many other Indian languages.
  +
I have fixed quite a few bugs on svn. I have been working around mar-hin pair and I have been successful in improving coverage by 15% on a wikipedia corpus adding around 53k tokens. I have also added coverage of adverbs and adjectives by scraping <avy> tags. It has been around a month and hence I have a very good gist on what all is needed to bring this pair to release quality. For detailed information about my tasks completed, refer to the section “Tasks completed till date”.
   
  +
== Who it will benefit in society and how ==
 
*'''Who?''' <br />
 
*'''Who?''' <br />
 
**Over 70 million Marathi speakers <br />
 
**Over 70 million Marathi speakers <br />
Line 43: Line 36:
 
Eventually helping people of different native languages to share space and reduce communication gap.<br />
 
Eventually helping people of different native languages to share space and reduce communication gap.<br />
   
  +
== Tasks completed till date ==
  +
*Set up Apertium environment, solved a few bugs on svn, went through MT course(on wiki).
  +
*Found major loopholes in mar-hin pair, namely
  +
**No rules to handle transitive and intransitive verbs
  +
**Pronouns missing
  +
**Adjectives and adverbs tagged and mapped incorrectly
  +
**Many basic nouns missing in marathi monolingual dictionary and correspondingly in bilingual dictionary
  +
*Scraped down the <avy> tags to corresponding <adv> and <adj> tags in bilingual dictionary, improving coverage for all adjectives and adverbs(which improved coverage by ~5.5% on wikipedia corpus)
  +
*Added and corrected tags of some very common adjectives which had been wrongly tagged as adverbs
  +
*Coding challenge:
  +
**Analysis:
  +
***Only 27% known tokens(improved to 74% now)
  +
***About ~20% of unknown words are intransitive verbs
  +
***About ~15% of unknown words are pronouns and their lexicals
  +
**Listing intransitive verbs and adding transfer rules to handle transitive/intransitive verbs. (ongoing)
  +
**Rules to handle pronouns. (few added, more to add)
  +
**Initial Status:
  +
***WER: ~87%
  +
***PER: ~84%
  +
**Current Status (ongoing):
  +
***WER: ~63% (~24% reduction)
  +
***PER: ~54% (~30% reduction)
   
  +
*Improved marathi bidix coverage for a wikipedia corpus (~3.6million tokens) from ~34% to ~49%. Reduced the count of unknown tokens by ~0.54m. (~1.9m still unknown).
== '''Workplan''' ==
 
   
  +
== Workplan ==
 
I plan to work on mar-hin pair for which I have already started working. My goal is to develop mar-hin pair to close to release quality for roughly the first month. Mar-hin pair is already decent in the mono dictionaries and morphological analyzers and hence in this one month I would focus on bilingual dictionary, building a good translator, adding transfer and lexical selection rules.<br />
 
I plan to work on mar-hin pair for which I have already started working. My goal is to develop mar-hin pair to close to release quality for roughly the first month. Mar-hin pair is already decent in the mono dictionaries and morphological analyzers and hence in this one month I would focus on bilingual dictionary, building a good translator, adding transfer and lexical selection rules.<br />
 
*Target WER<=20%<br />
 
*Target WER<=20%<br />
Line 51: Line 67:
 
For the remaining two months, my main focus will be to release the unreleased mar-guj pair and develop it to close to release quality and parallely expand and improve dictionaries for guj-hin pair, as much as time permits.<br />
 
For the remaining two months, my main focus will be to release the unreleased mar-guj pair and develop it to close to release quality and parallely expand and improve dictionaries for guj-hin pair, as much as time permits.<br />
 
For mar-guj pair:<br />
 
For mar-guj pair:<br />
*Target WER <=20%<br />
+
*Target WER<=20%<br />
 
*Target coverage~65%<br />
 
*Target coverage~65%<br />
  +
For guj-hin pair:<br/>
  +
*Target WER <40%
  +
*apertium-guj dictionary coverage >70%
  +
   
 
=== Detailed week-wise workplan ===
 
=== Detailed week-wise workplan ===
  +
{|class="wikitable"
'''Contact details'''
 
  +
! week
 
  +
! dates
Name: Aman Mehta
 
  +
!style="width: 40%"| goals
 
  +
! eval
E-mail: [mailto:amanmehta1997@gmail.com amanmehta1997@gmail.com]
 
  +
!style="width: 23%"| accomplishments
 
  +
!style="width: 23%"| notes
Svn: aman-mehta
 
  +
|-
 
  +
!colspan="2" style="text-align: right"|
IRC-nick: amanmehta
 
  +
Post Application Period<br/>
 
Mobile: +91 8329139961
 
 
Timezone: UTC+05:30
 
 
Github link: https://github.com/amanmehta-maniac
 
 
I stay online on IRC for most of my time so as to be easily accessible
 
 
 
'''Interest in machine translation'''
 
 
I am passionate about computers. Automation of tasks such as translation fascinates me. The core problem that translation of a text from one language to other can’t be solved by simple substitution of words, catches my interest. The idea of building a translation system and automating translation intrigues me. As MT gives people opportunity to access knowledge in multiple languages, it can play a pivotal role in education for all mission, not only in India but also across the globe. It adheres to the idea that knowledge should be free and accessible to all, which even I believe in strongly. Working on and around machine translation, serves my interest as well as my motivation.
 
 
 
'''Interested published tasks and project goals'''
 
 
I plan to “Adopt an unreleased language pair”, or to be precise, three language pairs: mar-hin, guj-hin, mar-guj. Mar-hin and guj-hin pairs are in incubator and mar-guj pair is still unreleased. My goal is to bring incubator pair: mar-hin and an unreleased pair: mar-guj, both to release quality. I also plan on expanding dictionaries for guj-hin pair and make further improvements to coverage and WER to the extent possible.
 
 
 
<span style="color:#000000;">Is it okay now?</span>'''Interest in Apertium '''
 
 
<div style="margin-left:0cm;margin-right:0cm;">Given my interest in machine translation, I decided to contribute to Apertium and enjoy adding my contribution to Apertium. I developed my interest in Apertium project in last couple of months during which I spent my time on resolving few svn bugs as well as on improving mar-hin pair. It is, at present, one of the best open-source machine translation platforms. Spending my summer to work for this platform would give me an opportunity to add my contribution in an area that fascinates me. </div>
 
 
 
<div style="margin-left:0cm;margin-right:0cm;">'''Reasons for Google and Apertium to sponsor '''</div>
 
 
The mar-hin and mar-guj pairs can be brought to a production quality without much effort due to lexical similarities. I am very well acquainted with apertium as well as with the language pairs I am proposing to work on. The odds of finding a polyglot who could add these pairs to Apertium in a single summer would probably be low. If successful, this would add a couple of more language pairs to Apertium which would triple the number of Indian language pairs. The release of these pairs could also help Apertium in expanding language pairs for many other Indian languages.
 
 
It has been ~2 months since I have joined Apertium and I am very much familiar with it. I have fixed quite a few bugs on svn. I have been working around mar-hin pair and I have been successful in adding coverage for adverbs and adjectives by scraping <avy> tags. It has been around a month and hence I have a very good gist on what all is needed to bring this pair to release quality. For detailed information about my tasks completed, refer to the section “Tasks completed till date”.
 
 
 
<div style="color:#000000;">'''Who it will benefit in society and how'''</div># <div style="margin-left:1.27cm;margin-right:0cm;">'''Who? '''</div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">Over 70 million Marathi speakers </div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">Over 50 million Gujarati speakers</div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">People belonging to non-native lingual state </div>
 
### <div style="margin-left:3.81cm;margin-right:0cm;">Eg: A gujarati speaker in Maharashtra (like myself)</div>
 
# <div style="margin-left:1.27cm;margin-right:0cm;">'''How?'''</div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">Translator available to learn languages </div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">Access to Hindi information</div>
 
### <div style="margin-left:3.81cm;margin-right:0cm;">Hindi media/newspapers</div>
 
## <div style="margin-left:2.54cm;margin-right:0cm;">Improved coverage of Hindi books to Marathi and Gujarati and vice-versa.</div>
 
 
 
 
Eventually helping people of different native languages to share space and reduce communication gap.
 
 
 
'''Workplan'''
 
 
I plan to work on mar-hin pair for which I have already started working. My goal is to develop mar-hin pair to close to release quality for roughly the first month. Mar-hin pair is already decent in the mono dictionaries and morphological analyzers and hence in this one month I would focus on bilingual dictionary, building a good translator, adding transfer and lexical selection rules.# <div style="margin-left:1.27cm;margin-right:0cm;">Target WER<=20%</div>
 
# <div style="margin-left:1.27cm;margin-right:0cm;">Target coverage~70%</div>
 
 
 
 
For the remaining two months, my main focus will be to release the unreleased mar-guj pair and develop it to close to release quality and parallely expand and improve dictionaries for guj-hin pair, as much as time permits.
 
 
For mar-guj pair:# <div style="margin-left:1.27cm;margin-right:0cm;">Target WER <=20%</div>
 
# <div style="margin-left:1.27cm;margin-right:0cm;">Target coverage~65%</div>
 
 
 
 
 
 
'''Detailed week-wise workplan:'''
 
 
 
 
{| style="border-spacing:0;width:18.627cm;"
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
| align=center| W
 
| align=center| Dates
 
| align=center| Goals
 
| align=center| Eval
 
| align=center| Accomplishments
 
| align=center| Notes
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
||
 
|| Post application period
 
 
 
[March 26 - May 3]
 
[March 26 - May 3]
  +
|
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Mar-hin: </div>
 
  +
*Mar-hin:
** <div style="margin-left:1.296cm;margin-right:0cm;">Coding challenge: Finish with good WER (target ~65%)</div>
 
  +
**Coding challenge: Finish with good WER (target <60%)
** <div style="margin-left:1.296cm;margin-right:0cm;">Get and analyse frequency list</div>
 
  +
**Prepare a corpora
** <div style="margin-left:1.296cm;margin-right:0cm;">Add transfer rules (t1x) in decreasing frequency</div>
 
  +
**Build and analyse frequency list for a wikipedia corpus
* <div style="margin-left:0.661cm;margin-right:0cm;">Learn about the scope of lttoolbox and other relevant tools and get acquainted</div>
 
  +
*Learn about the scope of lttoolbox and other relevant tools and get acquainted
* <div style="margin-left:0.661cm;margin-right:0cm;">Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)</div>
 
  +
*Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)
 
  +
*Target: Bidix-coverage ~52%
 
  +
|-
 
  +
!colspan="2" style="text-align: right"|
 
  +
Community Bbonding Period<br/>
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
||
 
|| Community Bonding Period
 
 
 
[May 5 - May 30]
 
[May 5 - May 30]
  +
|
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Read documents thoroughly</div>
 
  +
*Read documents thoroughly
* <div style="margin-left:0.661cm;margin-right:0cm;">Study CG and rules for mar-hin.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Discuss strategies/solutions with mentor to tackle loopholes.</div>
+
*Discuss strategies/solutions with mentor to tackle loopholes.
* <div style="margin-left:0.661cm;margin-right:0cm;">Add rules for transitive and intransitive verbs</div>
+
*Add transfer rules for transitive and intransitive verbs.
  +
*Testvoc and document state of the pair
* <div style="margin-left:0.661cm;margin-right:0cm;">Prepare a corpora</div>
 
  +
*Test for (@,#,*) tokens
* <div style="margin-left:0.661cm;margin-right:0cm;">Testvoc and document state of the pair</div>
 
  +
*First regression testing.
* <div style="margin-left:0.661cm;margin-right:0cm;">Add transfer rules (t1x)</div>
 
  +
*Target: Bidix-coverage ~60% (wikipedia corpus)
* <div style="margin-left:0.661cm;margin-right:0cm;">Test for (@,#,*) tokens</div>
 
  +
|-
* <div style="margin-left:0.661cm;margin-right:0cm;">First regression testing.</div>
 
  +
! 1 !! May 30 - June 5
  +
|
  +
*Even up nouns.
  +
*Add transfer rules for nouns, pronouns.
  +
*Start working for pronouns, adverbs and adjectives
  +
*Add appropriate rules/stems.
  +
*Bidix-coverage ~63%
  +
*Testvoc clean nouns
  +
*Start working with chunking (t2x)
  +
|-
  +
! 2 !! June 6 - June 12
  +
|
  +
*Add transfer rules for adjectives, adverbs
  +
*Take another 500-word story.
  +
**Target: WER <50%
  +
*Post-edit translated texts. Analyze and look for common rules and add rules
  +
*Coverage ~67%
  +
|-
  +
! 3 !! June 13 - June 19
  +
|
  +
*Testvoc clean for adjectives, adverbs
  +
*Add lexical selection rules
  +
*Corpus test, measure improvement, targets:
  +
**Bidix-coverage ~70%
  +
**WER <=25%
  +
|-
  +
! 4 !!
  +
June 20 - June 26
  +
|
  +
*Finish with lexical selection rules and chunking.
  +
*Start working on CG
  +
*Start working on disambiguation and its solutions
  +
*Refactoring and documentation.
  +
|-
  +
! 5 !!
  +
June 27 - July 3
  +
|
  +
*Run corpus testing to analyse to improvement. Target :
  +
**Coverage ~70%
  +
**WER <= 20% ('''Deliverable #1''')
  +
*Setup skeleton for mar-guj
  +
*Improve morphological analyzer if possible
  +
|-
  +
! 6 !!
  +
July 4 - July 10
  +
|
  +
*Find good parallel corpora and add words in decreasing frequency in apertium-guj.
  +
**Coverage ~45%
  +
*Parallely improve coverage of mar-guj bilingual dictionary
  +
**Bidix-coverage ~30%
  +
*Guj-hin Bidix-coverage improvement
  +
|-
  +
! 7 !! July 11 - July 17
  +
|
  +
*Work over a ~500 word story
  +
**Calculate WER, PER and document
  +
**Target WER <=55%
  +
*Even up nouns, pronouns
  +
*Even up for verbs, adjectives, adverbs
  +
|-
  +
! 8 !! July 18 - July 24
  +
|
  +
*Testvoc clean for all classes
  +
*Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
  +
**WER <=40%
  +
*Bidix-coverage ~50%
  +
|-
  +
! 9 !! July 25 - July 31
  +
|
  +
*Add transfer rules for nouns, pronouns
  +
*Add transfer rules for verbs, adjectives, adverbs.
  +
*Start working on CG and disambiguation
   
  +
|-
  +
! 10 !!
  +
August 1 - August 6
  +
|
  +
*Continue working on disambiguation and its solutions.
  +
*Add required transfer/lexical selection rules to improve WER, PER.
  +
*Begin with chunking and t3x
  +
|-
  +
! 11 !! August 7 - August 13
  +
|
  +
*Get another ~500 token story for guj-hin and improve WER.
  +
**Target WER <=25%
  +
*Regression testing for mar-guj pair
  +
*Evaluate test results, make the required changes, run tests again
  +
*User acceptance testing, gisting evaluation.
  +
*Mar-guj pair ready for or close to trunk and guj-hin improved. (Deliverable #2 and #3)
  +
|-
  +
! 12 !! August 14 - August 21
  +
|
  +
*Regression testing for all the three pairs
  +
*Discuss with mentor about some final changes that must be made.
  +
*Documentation, final release of all three pairs.
  +
*Detailed analysis on what further improvement could be made for the pairs, for future help to apertium.
   
 
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 1
 
|| May 30 - June 5
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Even up nouns.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Testvoc clean for noun</div>
 
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Start working for pronouns, adverbs and adjectives</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Add appropriate rules/stems.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Start working with chunking (t2x)</div>
 
 
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 2
 
|| June 6 - June 12
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Post-edit translated texts. Analyze and look for common rules and add rules</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Testvoc clean for all classes</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Corpus test, measure improvement, targets: </div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">Coverage ~67%</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">WER <25%</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 3
 
|| June 13 - June 19
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Add more rules</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Finish with t2x.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Get done with t3x and t4x.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Read CG, lrx.</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 4
 
|| June 20 - June 26
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Get most frequent mistakes at this stage</div>
 
 
* <div style="margin-left:0.661cm;margin-right:0cm;">If measurable improvement could be made, make required plan</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Run corpus testing to analyse to improvement. Target :</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">Coverage ~70%</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">WER <= 20%</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Run final tests, look for major mistakes and scope for corrections</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 5
 
|| June 27 - July 3
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Update documentation</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Run tests again, release this version ('''Deliverable #1''')</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Setup skeleton for mar-guj</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Start with expanding apertium-guj dictionary</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Improve morphological analyzer if possible</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 6
 
|| July 4 - July 10
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Find good parallel corpora from wikipedia and add words and rules in decreasing frequency, especially in apertium-guj. (target ~500 entries)</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">Continue for guj-hin bilingual dictionary expansion parallely.*</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Even up nouns</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Testvoc nouns</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 7
 
|| July 11 - July 17
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Even up verbs</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Even up adjectives, adverbs</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Add transfer rules</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Work over a ~500 word story for better analysis.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Calculate WER, PER</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 8
 
|| July 18 -
 
 
July 24
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Evaluate the system and identify most frequent errors</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Improve mar-guj bilingual dictionary adding to its coverage. </div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Add more transfer rules for verbs, adjectives, adverbs and pronouns, Target:</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">Coverage (~60%)</div>
 
** <div style="margin-left:1.296cm;margin-right:0cm;">WER (<=30%)</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 9
 
|| July 25 - July 31
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Testvoc cleanup for all classes</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Work over a large corpora</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Regression and corpus testing</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Work on CG and disambiguation</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 10
 
|| August 1 - August 6
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Continue working on disambiguation and its solutions.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Add required transfer/lexical rules to improve WER, PER.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Begin with chunking, t3x and t4x.</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 11
 
|| August 7 - August 13
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Add words and rules, starting with most frequent words (in decreasing frequency) for '''guj-hin pair.'''</div>
 
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Get most frequent mistakes for '''mar-guj pair'''</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Begin final testing for '''mar-guj pair'''</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Check for any major feedbacks to '''mar-hin pair''', work over it.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">User acceptance testing, gisting evaluation.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Mar-guj & mar-hin pairs ready for or close to trunk and guj-hin improved. '''(Deliverable #2 and #3)'''</div>
 
 
 
||
 
||
 
||
 
|- style="border:1pt solid #000001;padding:0.176cm;"
 
|| 12
 
|| August 14 - August 17
 
|| * <div style="margin-left:0.661cm;margin-right:0cm;">Documentation, final release of all three pairs.</div>
 
* <div style="margin-left:0.661cm;margin-right:0cm;">Detailed analysis on what further improvement could be made for the pairs, for future help to apertium.</div>
 
 
 
||
 
||
 
||
 
|-
 
 
|}
 
|}
<div style="color:#000000;">''' '''</div>
 
 
<div style="color:#000000;">'''Tasks completed till date:'''</div>* <div style="margin-left:1.27cm;margin-right:0cm;">Found major loopholes in mar-hin pair, namely</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">No rules to handle transitive and intransitive verbs</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Pronouns missing</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Adjectives and adverbs tagged and mapped incorrectly</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Many basic nouns missing in marathi monolingual dictionary and correspondingly in bilingual dictionary</div>
 
* <div style="margin-left:1.27cm;margin-right:0cm;">Scraped down the <avy> tags to corresponding <adv> and <adj> tags in bilingual dictionary, improving coverage for all adjectives and adverbs(which improved coverage by ~5.5% on wikipedia corpus)</div>
 
* <div style="margin-left:1.27cm;margin-right:0cm;">Added and corrected tags of some very common adjectives which had been wrongly tagged as adverbs</div>
 
* <div style="margin-left:1.27cm;margin-right:0cm;">Coding challenge: </div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Analysis:</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">Only 27% known tokens</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">About ~20% of unknown words are intransitive verbs</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">About ~15% of unknown words are pronouns and their lexicals</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Listing intransitive verbs and adding transfer rules to handle intransitive verbs. (ongoing)</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Rules to handle pronouns. (few added, more to add)</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Initial Status:</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">WER: ~87%</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">PER: ~84%</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">Coverage: ~27%</div>
 
** <div style="margin-left:2.54cm;margin-right:0cm;">Current Status (ongoing):</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">WER: ~74%</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">PER: ~69%</div>
 
*** <div style="margin-left:3.81cm;margin-right:0cm;">Coverage: 74.5%</div>
 
* <div style="margin-left:1.27cm;margin-right:0cm;">Improved marathi bidix coverage for a wikipedia corpus (~3.6million tokens) from 34% to 49%. Reduced the count of unknown tokens by ~53k</div>
 
 
 
 
<div style="color:#000000;"></div>
 
 
'''Skills'''
 
 
I am a second year student, pursuing BTech and MS by research in Computer Science at International Institute of Information Technology. I have proficient knowledge about Databases, Computer Programming, Data Structures, Algorithms and Artificial Intelligence. I am very comfortable with python, XML, bash scripting and C++. I am a code-enthusiast and can dedicate and focus for long hours when it comes to coding. I have gone through Machine Translation course mentioned on the wiki and played around mar-hin pair to get gist of how machine translation works at Apertium. I have contributed to Mozilla in past, and can work with large files. At the linguistic side, I can read and write Hindi very well. I am a native Gujarati and can converse, read & write in Marathi well.
 
 
   
  +
==Skills==
'''Non GSoC Commitment'''
 
  +
I am a second year student, pursuing BTech and MS by research in Computer Science at International Institute of Information Technology, Hyderabad. I have proficient knowledge about Databases, Computer Programming, Data Structures, Algorithms and Artificial Intelligence. I am very comfortable with python, XML, bash scripting and C++. I am a code-enthusiast and can dedicate, focus for long hours when it comes to it. I have gone through Machine Translation course mentioned on the wiki and played around mar-hin pair to get gist of how machine translation works at Apertium. I have contributed to Mozilla in past, and can work with large files. At the linguistic side, I can read and write Hindi very well. I am a native Gujarati and can converse, read & write in Marathi well.
   
  +
==Non GSoC Commitment ==
 
I don’t have many other commitments for this summer and I can spend ~40 hours per week for this project. My college curriculum will start around in July end, but I will still be able to dedicate at least 30 hours per week. Maintaining, on an average, at least 35 hours per week for the complete summer.
 
I don’t have many other commitments for this summer and I can spend ~40 hours per week for this project. My college curriculum will start around in July end, but I will still be able to dedicate at least 30 hours per week. Maintaining, on an average, at least 35 hours per week for the complete summer.

Latest revision as of 11:06, 29 April 2017

Contact details[edit]

Name: Aman Mehta
E-mail: amanmehta1997@gmail.com
Svn: aman-mehta
IRC-nick: amanmehta
Mobile: +91 8329139961
Timezone: UTC+05:30
Github link: https://github.com/amanmehta-maniac
I stay online on IRC for most of my time so as to be easily accessible

Interest in machine translation[edit]

I am passionate about computers. Automation of tasks such as translation fascinates me. The core problem that translation of a text from one language to other can’t be solved by simple substitution of words, catches my interest. The idea of building a translation system and automating translation intrigues me. As MT gives people opportunity to access knowledge in multiple languages, it can play a pivotal role in education for all mission, not only in India but also across the globe. It adheres to the idea that knowledge should be free and accessible to all, which even I believe in strongly. Working on and around machine translation, serves my interest as well as my motivation.

Interested published tasks and project goals[edit]

I plan to “Adopt an unreleased language pair”, or to be precise, three language pairs: mar-hin, guj-hin, mar-guj. Mar-hin and guj-hin pairs are in incubator and mar-guj pair is still unreleased. My goal is to bring incubator pair: mar-hin and an unreleased pair: mar-guj, both to release quality, with WER <=20% and bilingual dictionary coverage >=70% for both pairs. I also plan on expanding dictionaries for guj-hin pair and make further improvements to coverage and WER (Targetting it to be <40%) to the extent possible.

Interest in Apertium[edit]

Given my interest in machine translation, I decided to contribute to Apertium and enjoy adding my contribution to Apertium. I developed my interest in Apertium project in last couple of months during which I spent my time on resolving few svn bugs as well as on improving mar-hin pair. Going through the MT course first and then improving mar-hin pair, brought my interest to developing language pairs for Apertium. It is, at present, one of the best open-source machine translation platforms. Spending my summer to work for this platform would give me an opportunity to add my contribution in an area that fascinates me.

Reasons for Google and Apertium to sponsor[edit]

The mar-hin and mar-guj pairs can be brought to release quality without much effort due to lexical similarities. I am very well acquainted with apertium as well as with the language pairs I am proposing to work on. The odds of finding a polyglot who could add these pairs to Apertium in a single summer would probably be low. If successful, this would add a couple of more language pairs to Apertium which would triple the number of Indian language pairs. The release of these pairs could also help Apertium in expanding language pairs for many other Indian languages. I have fixed quite a few bugs on svn. I have been working around mar-hin pair and I have been successful in improving coverage by 15% on a wikipedia corpus adding around 53k tokens. I have also added coverage of adverbs and adjectives by scraping <avy> tags. It has been around a month and hence I have a very good gist on what all is needed to bring this pair to release quality. For detailed information about my tasks completed, refer to the section “Tasks completed till date”.

Who it will benefit in society and how[edit]

  • Who?
    • Over 70 million Marathi speakers
    • Over 50 million Gujarati speakers
    • People belonging to non-native lingual state
      • Eg: A gujarati speaker in Maharashtra (like myself)
  • How?
    • Translator available to learn languages
    • Access to Hindi information
      • Hindi media/newspapers
    • Improved coverage of Hindi books to Marathi and Gujarati and vice-versa.

Eventually helping people of different native languages to share space and reduce communication gap.

Tasks completed till date[edit]

  • Set up Apertium environment, solved a few bugs on svn, went through MT course(on wiki).
  • Found major loopholes in mar-hin pair, namely
    • No rules to handle transitive and intransitive verbs
    • Pronouns missing
    • Adjectives and adverbs tagged and mapped incorrectly
    • Many basic nouns missing in marathi monolingual dictionary and correspondingly in bilingual dictionary
  • Scraped down the <avy> tags to corresponding <adv> and <adj> tags in bilingual dictionary, improving coverage for all adjectives and adverbs(which improved coverage by ~5.5% on wikipedia corpus)
  • Added and corrected tags of some very common adjectives which had been wrongly tagged as adverbs
  • Coding challenge:
    • Analysis:
      • Only 27% known tokens(improved to 74% now)
      • About ~20% of unknown words are intransitive verbs
      • About ~15% of unknown words are pronouns and their lexicals
    • Listing intransitive verbs and adding transfer rules to handle transitive/intransitive verbs. (ongoing)
    • Rules to handle pronouns. (few added, more to add)
    • Initial Status:
      • WER: ~87%
      • PER: ~84%
    • Current Status (ongoing):
      • WER: ~63% (~24% reduction)
      • PER: ~54% (~30% reduction)
  • Improved marathi bidix coverage for a wikipedia corpus (~3.6million tokens) from ~34% to ~49%. Reduced the count of unknown tokens by ~0.54m. (~1.9m still unknown).

Workplan[edit]

I plan to work on mar-hin pair for which I have already started working. My goal is to develop mar-hin pair to close to release quality for roughly the first month. Mar-hin pair is already decent in the mono dictionaries and morphological analyzers and hence in this one month I would focus on bilingual dictionary, building a good translator, adding transfer and lexical selection rules.

  • Target WER<=20%
  • Target coverage~70%

For the remaining two months, my main focus will be to release the unreleased mar-guj pair and develop it to close to release quality and parallely expand and improve dictionaries for guj-hin pair, as much as time permits.
For mar-guj pair:

  • Target WER<=20%
  • Target coverage~65%

For guj-hin pair:

  • Target WER <40%
  • apertium-guj dictionary coverage >70%


Detailed week-wise workplan[edit]

week dates goals eval accomplishments notes

Post Application Period
[March 26 - May 3]

  • Mar-hin:
    • Coding challenge: Finish with good WER (target <60%)
    • Prepare a corpora
    • Build and analyse frequency list for a wikipedia corpus
  • Learn about the scope of lttoolbox and other relevant tools and get acquainted
  • Analyse opportunity to improve dictionaries (tag editing/expand dictionaries)
  • Target: Bidix-coverage ~52%

Community Bbonding Period
[May 5 - May 30]

  • Read documents thoroughly
  • Discuss strategies/solutions with mentor to tackle loopholes.
  • Add transfer rules for transitive and intransitive verbs.
  • Testvoc and document state of the pair
  • Test for (@,#,*) tokens
  • First regression testing.
  • Target: Bidix-coverage ~60% (wikipedia corpus)
1 May 30 - June 5
  • Even up nouns.
  • Add transfer rules for nouns, pronouns.
  • Start working for pronouns, adverbs and adjectives
  • Add appropriate rules/stems.
  • Bidix-coverage ~63%
  • Testvoc clean nouns
  • Start working with chunking (t2x)
2 June 6 - June 12
  • Add transfer rules for adjectives, adverbs
  • Take another 500-word story.
    • Target: WER <50%
  • Post-edit translated texts. Analyze and look for common rules and add rules
  • Coverage ~67%
3 June 13 - June 19
  • Testvoc clean for adjectives, adverbs
  • Add lexical selection rules
  • Corpus test, measure improvement, targets:
    • Bidix-coverage ~70%
    • WER <=25%
4

June 20 - June 26

  • Finish with lexical selection rules and chunking.
  • Start working on CG
  • Start working on disambiguation and its solutions
  • Refactoring and documentation.
5

June 27 - July 3

  • Run corpus testing to analyse to improvement. Target :
    • Coverage ~70%
    • WER <= 20% (Deliverable #1)
  • Setup skeleton for mar-guj
  • Improve morphological analyzer if possible
6

July 4 - July 10

  • Find good parallel corpora and add words in decreasing frequency in apertium-guj.
    • Coverage ~45%
  • Parallely improve coverage of mar-guj bilingual dictionary
    • Bidix-coverage ~30%
  • Guj-hin Bidix-coverage improvement
7 July 11 - July 17
  • Work over a ~500 word story
    • Calculate WER, PER and document
    • Target WER <=55%
  • Even up nouns, pronouns
  • Even up for verbs, adjectives, adverbs
8 July 18 - July 24
  • Testvoc clean for all classes
  • Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
    • WER <=40%
  • Bidix-coverage ~50%
9 July 25 - July 31
  • Add transfer rules for nouns, pronouns
  • Add transfer rules for verbs, adjectives, adverbs.
  • Start working on CG and disambiguation
10

August 1 - August 6

  • Continue working on disambiguation and its solutions.
  • Add required transfer/lexical selection rules to improve WER, PER.
  • Begin with chunking and t3x
11 August 7 - August 13
  • Get another ~500 token story for guj-hin and improve WER.
    • Target WER <=25%
  • Regression testing for mar-guj pair
  • Evaluate test results, make the required changes, run tests again
  • User acceptance testing, gisting evaluation.
  • Mar-guj pair ready for or close to trunk and guj-hin improved. (Deliverable #2 and #3)
12 August 14 - August 21
  • Regression testing for all the three pairs
  • Discuss with mentor about some final changes that must be made.
  • Documentation, final release of all three pairs.
  • Detailed analysis on what further improvement could be made for the pairs, for future help to apertium.

Skills[edit]

I am a second year student, pursuing BTech and MS by research in Computer Science at International Institute of Information Technology, Hyderabad. I have proficient knowledge about Databases, Computer Programming, Data Structures, Algorithms and Artificial Intelligence. I am very comfortable with python, XML, bash scripting and C++. I am a code-enthusiast and can dedicate, focus for long hours when it comes to it. I have gone through Machine Translation course mentioned on the wiki and played around mar-hin pair to get gist of how machine translation works at Apertium. I have contributed to Mozilla in past, and can work with large files. At the linguistic side, I can read and write Hindi very well. I am a native Gujarati and can converse, read & write in Marathi well.

Non GSoC Commitment[edit]

I don’t have many other commitments for this summer and I can spend ~40 hours per week for this project. My college curriculum will start around in July end, but I will still be able to dedicate at least 30 hours per week. Maintaining, on an average, at least 35 hours per week for the complete summer.