Difference between revisions of "User:Raveesh/Application"
(4 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
== Which of the published tasks are you interested in? What do you plan to do ? == |
== Which of the published tasks are you interested in? What do you plan to do ? == |
||
The project idea that I would like to work on is '''“Bring a released language pair up to state-of-the-art quality”'''. I would like to work on the Hindi-English language pair. <br/> Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium. <br/> Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output. <br/> I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium. |
The project idea that I would like to work on is '''“Bring a released language pair up to state-of-the-art quality”'''. I would like to work on the '''Hindi-English''' language pair. <br/> Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium. <br/> Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output. <br/> I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium. |
||
⚫ | |||
⚫ | I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French. |
||
⚫ | |||
⚫ | |||
== Coding Challenge == |
== Coding Challenge == |
||
Line 86: | Line 80: | ||
|- |
|- |
||
|} |
|} |
||
Test articles: |
|||
{|class="wikitable" |
|||
! Article |
|||
! Metric |
|||
! Before |
|||
! After |
|||
! Improvement |
|||
|- |
|||
| |
|||
gravity |
|||
| |
|||
WER |
|||
| |
|||
| |
|||
70% |
|||
| |
|||
|- |
|||
| |
|||
gravity |
|||
| |
|||
PER |
|||
| |
|||
| |
|||
68% |
|||
| |
|||
|- |
|||
| |
|||
forbes |
|||
| |
|||
WER |
|||
| |
|||
| |
|||
96% |
|||
| |
|||
|- |
|||
| |
|||
forbes |
|||
| |
|||
PER |
|||
| |
|||
| |
|||
77% |
|||
| |
|||
|- |
|||
|} |
|||
Coverage :<br/> |
|||
Coverage of the system was calculated in Wikipedia corpus and here are the results - <br/> |
|||
Total tokens in the corpus : 20526525 <br/> |
|||
Token identified : 17199251 <br/> |
|||
Coverage = 83.79% <br/> |
|||
== Workplan == |
== Workplan == |
||
Line 112: | Line 164: | ||
! 1 !! 19 - 24 May |
! 1 !! 19 - 24 May |
||
| |
| |
||
*Increase in coverage of 1% over Wikipedia (>= 84.7%) |
|||
Manually explore and clean the dictionaries for duplicate entries and incorrect entries/paradigms for open class words. |
|||
*Decrease in error rate 2% over test corpora |
|||
*Testvoc clean in all classes. |
|||
|- |
|- |
||
! 2 !! 25 - 31 May |
! 2 !! 25 - 31 May |
||
Line 195: | Line 250: | ||
|} |
|} |
||
⚫ | |||
⚫ | I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French. |
||
⚫ | |||
⚫ | |||
[[Category:GSoC 2014 Student proposals|Raveesh]] |
|||
== References == |
== References == |
Latest revision as of 22:05, 21 May 2014
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Benefit to the Society
- 5 Which of the published tasks are you interested in? What do you plan to do ?
- 6 Coding Challenge
- 7 Workplan
- 8 Skills
- 9 Non GSoC Commitment
- 10 References
Contact Information[edit]
- Name: Raveesh Motlani
- E-mail address: raveesh.motlani@gmail.com , raveesh.motlani@students.iiit.ac.in
- Phone Number : +91-9703248000
Why is it you are interested in machine translation?[edit]
This is a generation of information exchange from all over the world. One of the biggest challenges here is sharing and understanding knowledge in different languages. This is where machine translation comes into picture.
I am a student of Computational Linguistics and I understand that machine translation is an aspect where every concept of CL is integrated to create a system which can successfully translate from one language to another, thus reducing human effort and can be accessed in a place where there are no speakers of this language. The challenge in itself is very interesting and its application can be seen in every aspect of life and technology today.
Also, I’ve been working on building a Hindi-English machine translation system based on Cunie.
Why is it that you are interested in the Apertium project?[edit]
I have been working on a machine translation project from the past 6 months. I could relate to the project- “Bring a released language pair up to state-of-the-art quality (Hindi-English)”. I have been working on a hybrid system (statistical and example based hybrid model). While looking into this project which is a rule based model, I observed that making a few changes in dictionary and transfer rules could bring about a lot of changes in the quality of the result. This got me interested in transfer rule based machine translation ability. With the help of the Apertium team, I have understood the work and wish to contribute to this project.
Apart from the technical interest in this particular project, it is appealing because Apertium is open source and open content. The developers of Apertium are very knowledgeable and helpful, I believe working with them would be a great experience.
Benefit to the Society[edit]
Hindi as a language is spoken and understood by a vast population. Around 4.46 per cent of the world population comprises of native Hindi Speakers[1] and only 20.68 per cent of this Hindi speaking population can understand English. The world is undergoing globalisation arising from the interchange of world views, products, idea and other aspects of culture. A language barrier would be the biggest obstacle in this scenario.
A proper working Hindi-English translator will ease the effort for a non-Hindi speaker to understand the work by a Hindi speaker and vice-versa. For practical purposes, it will also help the international business firms to deal with the local Indian firms. Such a tool will also increase the productivity of Hindi-speaking employees working for international firms. We do not have a completely working translator for Hindi-English language pair. If this model can be deployed with a good accuracy, it’ll benefit the society greatly.
Which of the published tasks are you interested in? What do you plan to do ?[edit]
The project idea that I would like to work on is “Bring a released language pair up to state-of-the-art quality”. I would like to work on the Hindi-English language pair.
Some work has already been done for this language pair. I would like to make this language pair available for release by the end of the coding period. It currently lies in the incubator stage of Apertium.
Firstly, I plan to expand the dictionaries with the help of Shabdanjali, which has 88% coverage (36,000 entries) and bring up the coverage of dictionary in Apertium. There’s a huge corpus available for Hindi (John Hopkins University, Charles University, Wikipedia, BBC, etc.). I will go through the data and select suitable ones. Using this data, I’ll build more transfer grammar rules, lexical selection rules, disambiguation rules and improve the translation. Another aim would be to remove all kinds of unanalysed/unknown symbols (@,#,*) from the output.
I've already got quite familiar with the Apertium framework while working on the coding challenge of Hindi-English MT system. I added support for a lot of verbs, nouns and adjectives to both the mono and bilingual dictionaries. I also went through a lot of documentation of Apertium about writing transfer grammar rules, monodix basics, and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts and develop a system that can be released by Apertium.
Coding Challenge[edit]
Tasks completed[edit]
- Set up the working environment (installation and configuration).
- Pick up four articles of about 500 words each.
- Keep two articles for development and post-edit them for reference translation.
- Test the current system on this development text.
- Improve the current system by:
- Adding unknown tokens from the text to the bilingual dictionary.
- Adding entries for these tokens in Hindi and English monolingual dictionaries.
- Adding lexical selection rules, disambiguation and transfer grammar rules.
- Calculate the improvement based on WER/PER scores.
Results[edit]
Development articles:
Article | Metric | Before | After | Improvement |
---|---|---|---|---|
pizza |
WER |
97% |
82% |
15% |
pizza |
PER |
79% |
47% |
32% |
blog2 |
WER |
69% |
||
blog2 |
PER |
67% |
Test articles:
Article | Metric | Before | After | Improvement |
---|---|---|---|---|
gravity |
WER |
70% |
||
gravity |
PER |
68% |
||
forbes |
WER |
96% |
||
forbes |
PER |
77% |
Coverage :
Coverage of the system was calculated in Wikipedia corpus and here are the results -
Total tokens in the corpus : 20526525
Token identified : 17199251
Coverage = 83.79%
Workplan[edit]
week | dates | goals | eval | accomplishments | notes |
---|---|---|---|---|---|
post-application period 22 March - 20 April |
Dates 22 March - 20 April
| ||||
community bonding period 21 April - 19 May |
| ||||
1 | 19 - 24 May |
| |||
2 | 25 - 31 May |
| |||
3 | 1 - 7 June |
| |||
4 | 8 - 14 June |
| |||
Deliverable #1 15 June |
Dictionary, corpus, post-edits, coverage and evaluation results. | ||||
5 | 15 - 21 June |
| |||
midterm eval 23 - 27 June |
|||||
6 | 29 June - 5 July |
| |||
7 | 6 - 12 July |
| |||
8 | 13 - 19 July |
| |||
Deliverable #2 20 July |
Report of post-edit analysis, test results of experimentation on rules (change in WER/PER), expanded dictionaries and coverage. | ||||
9 | 20 - 26 July |
| |||
10 | 27 July - 2 August |
Start looking into Constraint Grammar and disambiguation rules. | |||
11 | 3 - 10 August |
| |||
12 | 11 - 18 August |
| |||
pencils-down week final evaluation 18 August - 22 August |
|
Skills[edit]
I am a junior(3rd) year student studying in International Institute of Information Technology, Hyderabad, pursuing a BTech in Computer Science and MS by Research in Computational Linguistics. I have completed the following courses : Data Structures, Algorithms, Computer Networks, Operating systems, Computational Linguistics, Artificial Intelligence, Pattern Recognition and Natural Language Processing. I am doing my Honors project on “Building a Hybrid System of Hindi-English Machine Translation” under Dr. Dipti Misra Sharma in LTRC, IIIT-H. I am comfortable in C and Python. As part of my NLP course, I had developed a K-NN based, simple POS tagger for Hindi, English, Tamil, Telugu and a parser for Hindi in Python. I am a multilingual with Hindi being my native language, Sindhi (mother-tongue), English and French.
Non GSoC Commitment[edit]
I don’t have many other commitments between May 19th and August 18th 2014. I can easily spend 40 hours a week on this project.