Difference between revisions of "User:Raveesh/Application"

From Apertium
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 15: Line 15:
== Which of the published tasks are you interested in? What do you plan to do ? ==
== Which of the published tasks are you interested in? What do you plan to do ? ==


The project idea that I would like to work on is '''“Bring a released language pair up to state-of-the-art quality”'''. I would like to work on the Hindi-English language pair. <br/> Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium. <br/> Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output. <br/> I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium.
The project idea that I would like to work on is '''“Bring a released language pair up to state-of-the-art quality”'''. I would like to work on the '''Hindi-English''' language pair. <br/> Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium. <br/> Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output. <br/> I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium.

== Skills ==
I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French.

== Non GSoC Commitment ==
I don’t have many other commitments between May 19th and August 18th 2014. I can easily spend 40 hours a week on this project.


== Coding Challenge ==
== Coding Challenge ==
Line 86: Line 80:
|-
|-
|}
|}

Test articles:
{|class="wikitable"
! Article
! Metric
! Before
! After
! Improvement
|-
|
gravity
|
WER
|

|
70%
|

|-
|
gravity
|
PER
|

|
68%
|

|-
|
forbes
|
WER
|

|
96%
|
|-
|
forbes
|
PER
|

|
77%
|
|-
|}

Coverage :<br/>
Coverage of the system was calculated in Wikipedia corpus and here are the results - <br/>
Total tokens in the corpus : 20526525 <br/>
Token identified : 17199251 <br/>
Coverage = 83.79% <br/>


== Workplan ==
== Workplan ==
Line 112: Line 164:
! 1 !! 19 - 24 May
! 1 !! 19 - 24 May
|
|
*Increase in coverage of 1% over Wikipedia (>= 84.7%)
Manually explore and clean the dictionaries for duplicate entries and incorrect entries/paradigms for open class words.
*Decrease in error rate 2% over test corpora
*Testvoc clean in all classes.

|-
|-
! 2 !! 25 - 31 May
! 2 !! 25 - 31 May
Line 195: Line 250:
|}
|}


== Skills ==
I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French.

== Non GSoC Commitment ==
I don’t have many other commitments between May 19th and August 18th 2014. I can easily spend 40 hours a week on this project.



[[Category:GSoC 2014 Student proposals|Raveesh]]


== References ==
== References ==

Latest revision as of 22:05, 21 May 2014

Contact Information[edit]

  • Name: Raveesh Motlani
  • E-mail address: raveesh.motlani@gmail.com , raveesh.motlani@students.iiit.ac.in
  • Phone Number : +91-9703248000

Why is it you are interested in machine translation?[edit]

This is a generation of information exchange from all over the world. One of the biggest challenges here is sharing and understanding knowledge in different languages. This is where machine translation comes into picture.
I am a student of Computational Linguistics and I understand that machine translation is an aspect where every concept of CL is integrated to create a system which can successfully translate from one language to another, thus reducing human effort and can be accessed in a place where there are no speakers of this language. The challenge in itself is very interesting and its application can be seen in every aspect of life and technology today.
Also, I’ve been working on building a Hindi-English machine translation system based on Cunie.

Why is it that you are interested in the Apertium project?[edit]

I have been working on a machine translation project from the past 6 months. I could relate to the project- “Bring a released language pair up to state-of-the-art quality (Hindi-English)”. I have been working on a hybrid system (statistical and example based hybrid model). While looking into this project which is a rule based model, I observed that making a few changes in dictionary and transfer rules could bring about a lot of changes in the quality of the result. This got me interested in transfer rule based machine translation ability. With the help of the Apertium team, I have understood the work and wish to contribute to this project.
Apart from the technical interest in this particular project, it is appealing because Apertium is open source and open content. The developers of Apertium are very knowledgeable and helpful, I believe working with them would be a great experience.

Benefit to the Society[edit]

Hindi as a language is spoken and understood by a vast population. Around 4.46 per cent of the world population comprises of native Hindi Speakers[1] and only 20.68 per cent of this Hindi speaking population can understand English. The world is undergoing globalisation arising from the interchange of world views, products, idea and other aspects of culture. A language barrier would be the biggest obstacle in this scenario.
A proper working Hindi-English translator will ease the effort for a non-Hindi speaker to understand the work by a Hindi speaker and vice-versa. For practical purposes, it will also help the international business firms to deal with the local Indian firms. Such a tool will also increase the productivity of Hindi-speaking employees working for international firms. We do not have a completely working translator for Hindi-English language pair. If this model can be deployed with a good accuracy, it’ll benefit the society greatly.

Which of the published tasks are you interested in? What do you plan to do ?[edit]

The project idea that I would like to work on is “Bring a released language pair up to state-of-the-art quality”. I would like to work on the Hindi-English language pair.
Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium.
Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output.
I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium.

Coding Challenge[edit]

Tasks completed[edit]

  • Set up the working environment (installation and configuration).
  • Pick up four articles of about 500 words each.
  • Keep two articles for development and post-edit them for reference translation.
  • Test the current system on this development text.
  • Improve the current system by:
    • Adding unknown tokens from the text to the bilingual dictionary.
    • Adding entries for these tokens in Hindi and English monolingual dictionaries.
    • Adding lexical selection rules, disambiguation and transfer grammar rules.
    • Calculate the improvement based on WER/PER scores.

Results[edit]

Development articles:

Article Metric Before After Improvement

pizza

WER

97%

82%

15%

pizza

PER

79%

47%

32%

blog2

WER

69%

blog2

PER

67%

Test articles:

Article Metric Before After Improvement

gravity

WER

70%

gravity

PER

68%

forbes

WER

96%

forbes

PER

77%

Coverage :
Coverage of the system was calculated in Wikipedia corpus and here are the results -
Total tokens in the corpus : 20526525
Token identified : 17199251
Coverage = 83.79%

Workplan[edit]

week dates goals eval accomplishments notes
post-application period
22 March - 20 April

Dates 22 March - 20 April

  • Reduce WER as much as possible. (target ~60%)
  • Run Tests on other datasets/articles.
  • Read Apertium’s documentation
community bonding period
21 April - 19 May
  • Study the rules of language pair Hindi-English thoroughly.
  • Prepare a corpora.
  • Document current coverage and state of the system.
  • Testvoc, regression test
1 19 - 24 May
  • Increase in coverage of 1% over Wikipedia (>= 84.7%)
  • Decrease in error rate 2% over test corpora
  • Testvoc clean in all classes.
2 25 - 31 May
  • Continue cleanup for closed class words.
  • Select a corpus, clean it, divide it into segments (test and dev)
  • Start post-editing.
3 1 - 7 June
  • Finish post-editing.
  • Generate frequency table of the corpus
  • Calculate coverage of morph analyser and bilingual dictionary.
4 8 - 14 June
  • Expand dictionaries by adding unknown words.
  • Calculate increment in coverage.
  • Evaluate the output (WER/PER)
Deliverable #1
15 June

Dictionary, corpus, post-edits, coverage and evaluation results.

5 15 - 21 June
  • Expand dictionaries by adding unknown words.
  • Calculate increment in coverage.
  • Evaluate the output (WER/PER)
midterm eval
23 - 27 June
6 29 June - 5 July
  • Analysis of post-edits, generate common rules for most of the sentences.
  • Test for (@,#,*)-appended tokens (if any) and remove them.
  • Add lexical selection rules.
7 6 - 12 July
  • Calculate new WER/PER.
  • Reduce PER < 60%, document the results.
  • Start working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis.
8 13 - 19 July
  • Continue working on t1x.
  • Test the rules, new output and evaluate the whole experimentation.
  • Clean up the post-edits
  • tart working on chunking (t2x)
Deliverable #2
20 July

Report of post-edit analysis, test results of experimentation on rules (change in WER/PER), expanded dictionaries and coverage.

9 20 - 26 July
  • Finish working on t2x.
  • Work upon t3x and t4x.
  • Read Constraint Grammar (for implementing context based lexical selection)
10 27 July - 2 August

Start looking into Constraint Grammar and disambiguation rules.

11 3 - 10 August
  • Continue working on CG + disambiguation rules.
  • Testing on post-edits, document the results, draw comparison between system output before and after the changes were made.
12 11 - 18 August
  • Evaluate test results, make the required changes, run tests again.
  • Open system for beta-release and have a native speaker test it.
  • Documentation of results.
pencils-down week
final evaluation
18 August - 22 August
  • Documentation, Evaluation, Refactoring

Skills[edit]

I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French.

Non GSoC Commitment[edit]

I don’t have many other commitments between May 19th and August 18th 2014. I can easily spend 40 hours a week on this project.

References[edit]