Difference between revisions of "User:Rroychoudhury/GSoC 2020 Proposal"
(46 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== '''Personal Details''' == |
== '''Personal Details''' == |
||
Name : Rajarshi Roychoudhury<br> |
Name : Rajarshi Roychoudhury<br> |
||
Email address : rroychoudhury2@gmail.com<br> |
Email address : rroychoudhury2@gmail.com<br> |
||
Line 8: | Line 7: | ||
Timezone : GMT+5.30hrs<br> |
Timezone : GMT+5.30hrs<br> |
||
Current Designation : Undergraduate Researcher at Jadavpur University specialising in Natural Language Processing<br> |
Current Designation : Undergraduate Researcher at Jadavpur University specialising in Natural Language Processing<br> |
||
== '''About me''' == |
== '''About me''' == |
||
Line 25: | Line 23: | ||
== '''My Proposal''' == |
== '''My Proposal''' == |
||
<p style="font-size:16px">'''Title''': IMPROVING MACHINE TRANSLATION WITH SENTIMENT |
<p style="font-size:16px">'''Title''': IMPROVING MACHINE TRANSLATION WITH SENTIMENT TAGS IN HINDI-BANGLA PAIR</p> |
||
<p style="font-size:16px">'''Some problems with Translation pair for Bangla-Hindi''':</p> |
<p style="font-size:16px">'''Some problems with Translation pair for Bangla-Hindi''':</p> |
||
Existing models for Bangla-Hindi are far from accurate. For example, when the following sentence written in Bengali is given in Google Translate, it gives a completely incorrect translation.<br> |
Existing models for Bangla-Hindi are far from accurate. For example, when the following sentence written in Bengali is given in Google Translate, it gives a completely incorrect translation.<br> |
||
Line 54: | Line 52: | ||
*'''POS disambiguation'''- Lexical selection will have additional selection criteria of sentiment, which will help to disambiguate lexicons by generating rules based on patterns incorporating sentiment of words. (.lrx file)<br> |
*'''POS disambiguation'''- Lexical selection will have additional selection criteria of sentiment, which will help to disambiguate lexicons by generating rules based on patterns incorporating sentiment of words. (.lrx file)<br> |
||
*'''Chunking''': The sentiment of the words will be mainly used in the.t1x file , which identifies words, and groups of words, which may need to have their order altered, or tags adding. Order altering with sentiment analysis can solve the problem of the translation specified in the above section. Patterns can be generated by combining grammatical annotations and sentiment analysis of the words, and suitable reordering/ tag removal can be done. |
*'''Chunking''': The sentiment of the words will be mainly used in the.t1x file , which identifies words, and groups of words, which may need to have their order altered, or tags adding. Order altering with sentiment analysis can solve the problem of the translation specified in the above section. Patterns can be generated by combining grammatical annotations and sentiment analysis of the words, and suitable reordering/ tag removal can be done. |
||
'''Going with the example mentioned as mistranslation, the sentiment tag combined with other tags can be used to form patterns, and corresponding rules to output the correct lexicons and reordering can render the correct translation. |
|||
''' |
|||
'''How to do sentiment analysis on word level?''' |
'''How to do sentiment analysis on word level?''' |
||
There are 2 ways in which we can analyse sentiment on word level, |
|||
https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf is an example of a paper whuch used neural networks for text classification on word levels. Recent studies show that this method is very effective in predicting the sentiment polarity of a word. |
|||
*The linguist classifies the word according to the sentiment and adds the sentiment tags manually |
|||
SInce apertium deals with low resource languages , a huge exhaustive corpus is not available. However for any language, the number of unique characters is well defined (it is kept inside the <alphabet> tag of monolingual dictionary). We can determine character embedding weights for each of these characters , and treat words as a sequence of characters . These sequences of vectors can be fed inside a Recurrent-Neural-Network , and we can classify the words based on sentiments. The result can be stored in a file , so the end result will be independent of neural networks. Entries in the monolingual dictionaries will be modified by incorporating the sentiment determined as a tag . This method achieves a good accuracy even for a small corpus of 8000 words. Particularly for this project, I will use SentiWordnet Hindi (https://amitavadas.com/sentiwordnet.php), where sentiment annotated data is already present. Also it has ~8000 words for Bengali, Hindi each and hence can be a good resource. The code for sentiment prediction using character embedding with neural networks is given below.<br> |
|||
*A neural network classifies the word |
|||
[https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf This]is an example of a paper whuch used neural networks for text classification on word levels. Recent studies show that this method is very effective in predicting the sentiment polarity of a word. |
|||
SInce apertium deals with low resource languages , a huge exhaustive corpus is not available. However for any language, the number of unique characters is well defined (it is kept inside the <alphabet> tag of monolingual dictionary). We can determine character embedding weights for each of these characters , and treat words as a sequence of characters . These sequences of vectors can be fed inside a Recurrent-Neural-Network , and we can classify the words based on sentiments. The result can be stored in a file , so the end result will be independent of neural networks. Entries in the monolingual dictionaries will be modified by incorporating the sentiment determined as a tag . This method achieves a good accuracy even for a small corpus of 8000 words. Particularly for this project, I will use [https://amitavadas.com/sentiwordnet.php SentiWordnet Hindi], where sentiment annotated data is already present. Also it has ~8000 words for Bengali, Hindi each and hence can be a good resource. The code for sentiment prediction using character embedding with neural networks is given below.<br> |
|||
== ''' Workplan''' == |
== ''' Workplan''' == |
||
Line 73: | Line 76: | ||
**'''Coverage''': ~74% |
**'''Coverage''': ~74% |
||
*'''Resources''': |
*'''Resources''': |
||
**In apertium, we have |
|||
**For Bengali and Hindi we have Sentiwordnet which will give sentiment annotated data of ~9000 words on each, which will act as training data for sentiment classification. These words will also be incorporated in the dictionaries. Besides, wikipedia dumps, online text resources are also available. In apertium, we have |
|||
***Monolingual Bengali dictionary from apertium-bn-en |
***Monolingual Bengali dictionary from apertium-bn-en(Bengali-English) |
||
***Monolingual Hindi dictionary from apertium-hin |
***Monolingual Hindi dictionary from apertium-hin,(Apertium-Hindi)which will act as monolingual dictionary for apertium-ben-hin(Apertium-Bangla-Hindi).Another significant resource is |
||
**For Bengali and Hindi we have Sentiwordnet which will give sentiment annotated data of ~9000 words on each, which will act as training data for sentiment classification and these **words also be incorporated in the dictionaries. |
|||
**Besides, wikipedia dumps, online text resources are also available. A university professor of linguistics has agreed to review my *work. |
|||
*'''Calculation of word error rate(WER)''': |
*'''Calculation of word error rate(WER)''': |
||
**This will be done by calculated using random Wikipedia texts and online text available from various sources. My plan is to collect a good amount of resources on both languages and build dictionaries on the basis of that, and coverage/WER will be calculated against random Wikipedia tests. |
**This will be done by calculated using random Wikipedia texts and online text available from various sources. My plan is to collect a good amount of resources on both languages and build dictionaries on the basis of that, and coverage/WER will be calculated against random Wikipedia tests. |
||
Line 93: | Line 98: | ||
|- |
|- |
||
! Post-application period |
! Post-application period |
||
| style="text-align:center" | |
| style="text-align:center" | May 4-May 31 |
||
| |
| |
||
*Revising the grammar of Bengali and Hindi |
*Revising the grammar of Bengali and Hindi |
||
*Studying in detail the documentation and apertium workflow |
*Studying in detail the documentation and apertium workflow |
||
*Study the existing paradigms and lexicons in monolingual dictionaries, make paradigms to combine noun and postposition for accurate translation from Bengali to Hindi( for example, proper noun paradigm have nominative , objective and generative in Bengali, but no such inflection in the paradigm in existing Hindi dictionary ) |
*Study the existing paradigms and lexicons in monolingual dictionaries, make paradigms to combine noun and postposition for accurate translation from Bengali to Hindi( for example, proper noun paradigm have nominative , objective and generative in Bengali, but no such inflection in the paradigm in existing Hindi dictionary ) |
||
Collect resources on both languages |
*Collect resources on both languages |
||
| style="text-align:center" | '''Current<br>situation'''<br> |
| style="text-align:center" | '''Current<br>situation'''<br>325 |
||
| style="text-align:center" | '''Current<br>situation'''<br> |
| style="text-align:center" | '''Current<br>situation'''<br>43% |
||
| style="text-align:center" | '''Current<br>situation''' |
| style="text-align:center" | '''Current<br>situation''' |
||
|- |
|- |
||
! 1 |
! 1 |
||
| style="text-align:center" | |
| style="text-align:center" | June 1-June 7 |
||
| |
| |
||
*Determine sentiment of existing lexicons |
|||
* '''ita > cat''' |
|||
*Making paradigms to incorporate words like পরেছিল পড়েছিল” |
|||
* Expand bilingual dictionary cat-ita |
|||
*Incorporating sentiment tags in the existing and new lexicons |
|||
* Transfer and lexical selection rules (ita > cat) |
|||
*Expanding the bidix |
|||
| style="text-align:center" | ~11,000 (cat-ita) |
|||
| style="text-align:center" | ~600 |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | ~85.5% (ita > cat) |
|||
|- |
|- |
||
! 2 |
! 2 |
||
| style="text-align:center" | |
| style="text-align:center" | 8 June- 15 June |
||
| |
| |
||
*Start making lexical selection rules(ben) |
|||
* Expand bilingual dictionary cat-ita |
|||
*Start making transfer rules(ben-hin)[like treating locative,genitive,nominal and accusatory nouns, proper nouns, proper transfer for noun- ex: “বাগানে আছে-बगीचे में हैं”, also making gender-verb agreement]] |
|||
* Transfer and lexical selection rules (ita > cat) |
|||
| style="text-align:center" | ~13,000 (cat-ita) |
|||
*Expanding bidix |
|||
| style="text-align:center" | ~1,400 |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | ~87.5% (ita > cat) |
|||
|- |
|- |
||
! 3 |
! 3 |
||
| style="text-align:center" | |
| style="text-align:center" | 16 June - 23 June |
||
| |
| |
||
*Expanding lexical selection rules(ben) |
|||
* Expand bilingual dictionary cat-ita |
|||
*Expanding transfer rules(ben-hin),work on .t1x file[identifying and modifying transfer rules to deal with cases where sentiment tags are required] |
|||
* Transfer and lexical selection rules (ita > cat) |
|||
*Expanding bidix |
|||
| style="text-align:center" | ~14,000 (cat-ita) |
|||
*Documentation |
|||
| style="text-align:center" | <20% (ita > cat) |
|||
| style="text-align:center" | ~89% (ita > cat) |
|||
| style="text-align:center" | ~2,200 |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | |
|||
|- |
|- |
||
! 4 |
! 4 |
||
| style="text-align:center" | |
| style="text-align:center" | 24 June - 30 June |
||
| |
| |
||
*Expanding lexical selection rules(ben) |
|||
* '''cat > ita''' |
|||
*Expanding interchunk rules(ben-hin) work on .t2x file[ deal with 3 word transfer] |
|||
* Expand bilingual dictionary cat-ita |
|||
*Expanding bidix |
|||
* Transfer and lexical selection rules (cat > ita) |
|||
*Testing on resources(bn-hi translation) |
|||
* Testvoc cat-ita, ita-cat: closed categories |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~3,000 |
||
| style="text-align:center" | |
|||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | ~90% (cat > ita)<br>~90% (ita > cat) |
|||
|- |
|- |
||
! 5 |
! 5 |
||
| style="text-align:center" | |
| style="text-align:center" | 1 July - 8 July |
||
| |
| |
||
*Expanding lexical selection rules(ben) |
|||
* Expand bilingual dictionary cat-ita |
|||
*Expanding postchunk rules(ben-hin) work on .t3x file |
|||
* Transfer and lexical selection rules (cat > ita) |
|||
*Expanding bidix |
|||
* Testvoc cat-ita, ita-cat: vblex |
|||
*Testing on resources(ben-hin translation) |
|||
'''First evaluation''' (28 June) |
|||
*Testing by Translating Bengali to Hindi |
|||
| style="text-align:center" | ~16,000 (cat-ita) |
|||
*Documentation |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | ~90.5% (cat > ita)<br>~90.5% (ita > cat) |
|||
'''First evaluation''' (3 July) |
|||
| style="text-align:center" | ~4,000 |
|||
| style="text-align:center" | <38% |
|||
| style="text-align:center" | ~78% (ben)<br>~87% (hi) |
|||
|- |
|- |
||
! 6 |
! 6 |
||
| style="text-align:center" | |
| style="text-align:center" | 8 July - 15 July |
||
| |
| |
||
*Determine sentiment of existing lexicons in Hindi |
|||
* Expand bilingual dictionary cat-ita |
|||
*Incorporating sentiment tags in the existing and new lexicons |
|||
* Transfer and lexical selection rules (cat > ita) |
|||
*Expanding bidix |
|||
* Testvoc cat-ita, ita-cat: adj, adv, np |
|||
| style="text-align:center" | ~17,000 (cat-ita) |
|||
| style="text-align:center" | ~5,000 |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | ~91% (cat > ita)<br>~91% (ita > cat) |
|||
|- |
|- |
||
! 7 |
! 7 |
||
| style="text-align:center" | |
| style="text-align:center" | 16 July - 23 July |
||
| |
| |
||
* |
*Start making lexical selection rules(hin) |
||
*Start making transfer rules(hin-ben)[especially making rules to deal with patterns like nouns+post->noun(loc/gen/obj/nom) “बगीचे में हैं-বাগানে আছে-” ] |
|||
* Testvoc cat-ita, ita-cat: n |
|||
* Write documentation |
|||
*Expanding bidix |
|||
| style="text-align:center" | ~18,000 (cat-ita) |
|||
| style="text-align:center" | |
| style="text-align:center" | ~6,500 |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | |
|||
|- |
|- |
||
! 8 |
! 8 |
||
| style="text-align:center" | |
| style="text-align:center" | 24 July - 31 July |
||
| |
| |
||
*Expanding lexical selection rules(hin) |
|||
* '''por > cat''' |
|||
*Expanding transfer rules(hin-ben),work on .t1x file[designing transfer rules with sentiment analysis to deal with ambiguous/non-traditional cases]n |
|||
* Expand bilingual dictionary cat-por |
|||
*Expanding bidix |
|||
* Disambiguation rules (por > cat) |
|||
*Documentation |
|||
* Work on Portuguese proper names |
|||
'''Second evaluation''' (31 July) |
|||
| style="text-align:center" | ~9,500 (cat-por) |
|||
| style="text-align:center" | |
| style="text-align:center" | ~7,500 |
||
| style="text-align:center" | |
| style="text-align:center" | <33% |
||
| style="text-align:center" |~82%(ben)<br>~90%(hin) |
|||
|- |
|- |
||
! 9 |
! 9 |
||
| style="text-align:center" | |
| style="text-align:center" | 1 August - 8 August |
||
| |
| |
||
*Expanding lexical selection rules(hin) |
|||
* Expand bilingual dictionary |
|||
*Expanding transfer rules(hin-ben) work on .t2x file |
|||
* Disambiguation rules (por > cat) |
|||
*Expanding bidix |
|||
* Work on Portuguese proper names |
|||
*Testing on resources(hin-ben translation) |
|||
* Transfer and lexical selection rules (por > cat) |
|||
*Documentation |
|||
* Testvoc cat-por, por-cat: np |
|||
'''Second evaluation''' (26 July) |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~9,000 |
||
| style="text-align:center" | |
|||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | ~89% (por > cat) |
|||
|- |
|- |
||
! 10 |
! 10 |
||
| style="text-align:center" | |
| style="text-align:center" | 9 August - 16 August |
||
| |
| |
||
*Expanding lexical selection rules(hin) |
|||
* Expand bilingual dictionary |
|||
*Expanding transfer rules(hin-ben) work on .t3x file |
|||
* Disambiguation rules (por > cat) |
|||
*Expanding bidix |
|||
* Transfer and lexical selection rules (por > cat) |
|||
*Testing on resources(hin-ben translation) |
|||
| style="text-align:center" | ~13,000 (cat-por) |
|||
*Testing by translating from Hindi to Bengali |
|||
| style="text-align:center" | <20% (por > cat) |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~9,500 |
||
| style="text-align:center" | |
|||
| style="text-align:center" | |
|||
|- |
|- |
||
! 11 |
! 11 |
||
| style="text-align:center" | |
| style="text-align:center" | 17 August - 23 August |
||
| |
| |
||
*Adding constraint grammer to apertium-ben based on sentiment analysis |
|||
* '''cat > por''' |
|||
*Adding constraint grammar to apertium-hin based on sentiment analysis |
|||
* Expand bilingual dictionary |
|||
* Transfer and lexical selection rules (cat > por) |
|||
*Expanding bidix |
|||
* Testvoc cat-por, por-cat: closed categories, vblex |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~10,500 |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
|- |
|- |
||
! 12 |
! 12 |
||
| style="text-align:center" | |
| style="text-align:center" | 24 August - 31 August |
||
| |
| |
||
*Finish pending tasks |
|||
* Expand bilingual dictionary |
|||
*Testing |
|||
* Transfer and lexical selection rules (cat > por) |
|||
*Documentation |
|||
* Testvoc cat-por, por-cat: adj, adv |
|||
*'''Final Evaluation(31st August)''' |
|||
| style="text-align:center" | ~16,000 (cat-por) |
|||
| style="text-align:center" | |
| style="text-align:center" |~10,500 |
||
| style="text-align:center" | |
| style="text-align:center" | <28% |
||
| style="text-align:center" | ~87% (ben)<br>~90% (hin) |
|||
|- |
|- |
||
! 13 |
|||
| style="text-align:center" | 18 August - 25 August |
|||
| |
|||
* Expand bilingual dictionary |
|||
* Transfer and lexical selection rules (cat > por) |
|||
* Testvoc cat-por, por-cat: n |
|||
'''Final evaluation''' (26 August) |
|||
| style="text-align:center" | ~17,000 (cat-por) |
|||
| style="text-align:center" | <15% (cat > por)<br><15% (por > cat) |
|||
| style="text-align:center" | ~91.0% (cat > por)<br>~91.0% (por > cat) |
|||
|} |
|} |
||
=='''List your skills and give evidence of your qualifications'''== |
|||
I was a part of the research team that worked on “Sentiment Analysis on word level based on character embedding” in my university and have done 2-3 projects on Statistical Machine Translation. I have taught Bengali as a teaching assistant in my high school , which is my mother tongue. I have learnt Hindi in high school for 6 years and I am very familiar with the grammatical nuances of both the languages.Besides that I am familiar with Python, XML. Attached is my [https://drive.google.com/open?id=0BwLpcVkeJcn4Tmx6UjNKTmRqWF9Td0hGVGRKNjJ1TUtmaUFB CV] |
|||
== '''Coding Challenge''' == |
|||
'''== List your skills and give evidence of your qualifications == |
|||
*'''Sentiment Analysis''' |
|||
''' |
|||
**The main part of the coding challenge I did was create a neural network based program for sentiment analysis based on character level. The characters were encoded into a vector and were used as embedding weight for training. It was fed into some sequence to sequence models like LSTM ,RNN and also trained on a convolutional neural network. |
|||
I was a part of the research team that worked on “Sentiment Analysis on word level based on character embedding” in my university and have done 2-3 projects on Statistical Machine Translation. I have taught Bengali as a teaching assistant in my high school , which is my mother tongue. I have learnt Hindi in high school for 6 years and I am very familiar with the grammatical nuances of both the languages.Besides that I am familiar with Python, XML. Attached is my CV (https://drive.google.com/open?id=0BwLpcVkeJcn4Tmx6UjNKTmRqWF9Td0hGVGRKNjJ1TUtmaUFB) |
|||
*'''Coding Challenge for selected topic''' |
|||
**Install Apertium (see Minimal installation from SVN) |
|||
**Go through the HOWTO |
|||
**Wrote paradigms in monolingual Bengali dictionary for words like “রয়েছে খেয়েছে” |
|||
**Added words in monolingual begali and hindi dictionary |
|||
**Wrote intrachunk rules (ben-hin.t1x hin-ben.t1x) |
|||
**Wrote Constraint Grammar( ben.rlx , hin.rlx) |
|||
[https://github.com/RajarshiRoychoudhury/apertium-bn-hi Code can be found here] |
|||
== '''Why should Google and Apertium sponsor it?''' == |
|||
This project aims for a better translation system than the current ones, and also it will be a lot of help for people since these are very widely spoken languages. Moreover, the technique mentioned will be tried for the first time. |
|||
== '''List all the non Summer-of -codes plans you have for summer''' == |
|||
There is no non-summer of code plan for the time being. If any rescheduling of exams occurs, I will notify my mentors. Also, kept a week buffer to deal with that. I can give 50+ hours a week in general. |
Latest revision as of 14:49, 31 March 2020
Contents
- 1 Personal Details
- 2 About me
- 3 Why is it that I am interested in Apertium and Machine Translation
- 4 Which of the published tasks are you interested in?
- 5 My Proposal
- 6 Solution
- 7 Workplan
- 8 List your skills and give evidence of your qualifications
- 9 Coding Challenge
- 10 Why should Google and Apertium sponsor it?
- 11 List all the non Summer-of -codes plans you have for summer
Personal Details[edit]
Name : Rajarshi Roychoudhury
Email address : rroychoudhury2@gmail.com
IRC: Rajarshi
Github: https://github.com/RajarshiRoychoudhury
Timezone : GMT+5.30hrs
Current Designation : Undergraduate Researcher at Jadavpur University specialising in Natural Language Processing
About me[edit]
Open source softwares I use : Apertium , Tensorflow , Ubuntu.
Professional Interests : Natural language Processing ,Computational Linguistics , Sentiment Analysis , Statistical and Rule-based Machine Translation.
Why is it that I am interested in Apertium and Machine Translation[edit]
Being a Natural Language Processing researcher , Machine Translation intrigues me with its diverse nature.Finding a bridge between two languages is a hard task , and Machine Translation builds that bridge efficiently. Machine translation is beneficial for society for overcoming the language barrier . It also helps to gather resources on languages that are slowly becoming extinct.
Most of the current translation systems use statistical machine translation,Google being one of them. Apertium on the other hand uses rule-based machine translation , which is particularly useful for closely related language pairs. Moreover I am interested in Apertium because it deals with low-resource languages , which is a challenge for neural network based systems due to lack of data. Working with Apertium would let me explore the morphology and lexicons of the two languages I am most familiar with , and it would be interesting to develop transfer rules for building a bridge between the two so that people can utilise this for free.
Which of the published tasks are you interested in?[edit]
The published task I am interested in is a modified version of Adopt an unreleased language pair (http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code). I plan to work on the Bengali-Hindi pair and. Bengali is my mother-tongue and I am well conversant in Hindi and its grammar. However in addition to this I wish to incorporate some changes in the disambiguation and lexical selection step. Below are the details of what I propose to introduce.
My Proposal[edit]
Title: IMPROVING MACHINE TRANSLATION WITH SENTIMENT TAGS IN HINDI-BANGLA PAIR
Some problems with Translation pair for Bangla-Hindi:
Existing models for Bangla-Hindi are far from accurate. For example, when the following sentence written in Bengali is given in Google Translate, it gives a completely incorrect translation.
Bengali:”মেয়েটি দুধ জ্বাল দিতে দিতে বাবার সাথে কথা বলছে”
Hindi: “लड़की दूध देने के लिए अपने पिता से बात कर रही है”
Where as it should give
“दूध उबालते समय लड़की अपने पिता से बात कर रही है”.
In this case it fails to recognise “জ্বাল দিতে দিতে” as a present participle. This is because there is an intermediate translation in English, which loses a part of the source information.In such scenarios where we deal with closely related language pairs like Bangla-Hindi( which share the Sanskrit roots), RBMT is more effective.Talking specifically about Apertium, this can easily be solved in the 3 stage shunking step in the Apertium workflow.
Some problems with Apertium:
The current framework for Apertium has some disadvantages. One of them being the source information is lost during the target generation step. This may result in incorrect translation. Considering the released pairs of Apertium (in this case English-Esperanto) ,the sentence
“Its funny how thieves try to break into a house and get arrested.”
Gets translated to “Ties amuza kiel ŝtelistoj provas rompi en domo kaj akiri arestita.” , which takes the literal translation of the sentence and means
“the thieves try to {break into the house and get arrested}” instead of “the thieves try to {break into the house }and{ get arrested}”
To incorporate this information of the source language , we need a sense of what the sentence tries to convey, which is impossible considering that source information is lost during the target step generation, for only the grammatical inflections of the lexicons are stored. Some additional information on the words needs to be preserved.
Solution[edit]
The technique I will use is sentiment analysis on word level. The sentiment polarity of the words( out of 3 polarity-positive, negative , neutral) will be stored as an information(tag) on the lexicons in the monolingual dictionary.
This sentiment tag will be used in
- POS disambiguation- Lexical selection will have additional selection criteria of sentiment, which will help to disambiguate lexicons by generating rules based on patterns incorporating sentiment of words. (.lrx file)
- Chunking: The sentiment of the words will be mainly used in the.t1x file , which identifies words, and groups of words, which may need to have their order altered, or tags adding. Order altering with sentiment analysis can solve the problem of the translation specified in the above section. Patterns can be generated by combining grammatical annotations and sentiment analysis of the words, and suitable reordering/ tag removal can be done.
Going with the example mentioned as mistranslation, the sentiment tag combined with other tags can be used to form patterns, and corresponding rules to output the correct lexicons and reordering can render the correct translation. How to do sentiment analysis on word level? There are 2 ways in which we can analyse sentiment on word level,
- The linguist classifies the word according to the sentiment and adds the sentiment tags manually
- A neural network classifies the word
Thisis an example of a paper whuch used neural networks for text classification on word levels. Recent studies show that this method is very effective in predicting the sentiment polarity of a word.
SInce apertium deals with low resource languages , a huge exhaustive corpus is not available. However for any language, the number of unique characters is well defined (it is kept inside the <alphabet> tag of monolingual dictionary). We can determine character embedding weights for each of these characters , and treat words as a sequence of characters . These sequences of vectors can be fed inside a Recurrent-Neural-Network , and we can classify the words based on sentiments. The result can be stored in a file , so the end result will be independent of neural networks. Entries in the monolingual dictionaries will be modified by incorporating the sentiment determined as a tag . This method achieves a good accuracy even for a small corpus of 8000 words. Particularly for this project, I will use SentiWordnet Hindi, where sentiment annotated data is already present. Also it has ~8000 words for Bengali, Hindi each and hence can be a good resource. The code for sentiment prediction using character embedding with neural networks is given below.
Workplan[edit]
The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y.
Current Scenario:
- Hindi
- Number of stems: 37,833
- Paradigms:101
- Coverage: ~83.1%
- Bengali
- Number of stems: 8230
- Paradigms: 137
- Coverage: ~74%
- Resources:
- In apertium, we have
- Monolingual Bengali dictionary from apertium-bn-en(Bengali-English)
- Monolingual Hindi dictionary from apertium-hin,(Apertium-Hindi)which will act as monolingual dictionary for apertium-ben-hin(Apertium-Bangla-Hindi).Another significant resource is
- For Bengali and Hindi we have Sentiwordnet which will give sentiment annotated data of ~9000 words on each, which will act as training data for sentiment classification and these **words also be incorporated in the dictionaries.
- Besides, wikipedia dumps, online text resources are also available. A university professor of linguistics has agreed to review my *work.
- In apertium, we have
- Calculation of word error rate(WER):
- This will be done by calculated using random Wikipedia texts and online text available from various sources. My plan is to collect a good amount of resources on both languages and build dictionaries on the basis of that, and coverage/WER will be calculated against random Wikipedia tests.
Detailed Work Plan
Week | Dates | Goals | Bidix | WER | Coverage |
---|---|---|---|---|---|
Post-application period | May 4-May 31 |
|
Current situation 325 |
Current situation 43% |
Current situation |
1 | June 1-June 7 |
|
~600 | ||
2 | 8 June- 15 June |
|
~1,400 | ||
3 | 16 June - 23 June |
|
~2,200 | ||
4 | 24 June - 30 June |
|
~3,000 | ||
5 | 1 July - 8 July |
First evaluation (3 July) |
~4,000 | <38% | ~78% (ben) ~87% (hi) |
6 | 8 July - 15 July |
|
~5,000 | ||
7 | 16 July - 23 July |
|
~6,500 | ||
8 | 24 July - 31 July |
Second evaluation (31 July) |
~7,500 | <33% | ~82%(ben) ~90%(hin) |
9 | 1 August - 8 August |
|
~9,000 | ||
10 | 9 August - 16 August |
|
~9,500 | ||
11 | 17 August - 23 August |
|
~10,500 | ||
12 | 24 August - 31 August |
|
~10,500 | <28% | ~87% (ben) ~90% (hin) |
List your skills and give evidence of your qualifications[edit]
I was a part of the research team that worked on “Sentiment Analysis on word level based on character embedding” in my university and have done 2-3 projects on Statistical Machine Translation. I have taught Bengali as a teaching assistant in my high school , which is my mother tongue. I have learnt Hindi in high school for 6 years and I am very familiar with the grammatical nuances of both the languages.Besides that I am familiar with Python, XML. Attached is my CV
Coding Challenge[edit]
- Sentiment Analysis
- The main part of the coding challenge I did was create a neural network based program for sentiment analysis based on character level. The characters were encoded into a vector and were used as embedding weight for training. It was fed into some sequence to sequence models like LSTM ,RNN and also trained on a convolutional neural network.
- Coding Challenge for selected topic
- Install Apertium (see Minimal installation from SVN)
- Go through the HOWTO
- Wrote paradigms in monolingual Bengali dictionary for words like “রয়েছে খেয়েছে”
- Added words in monolingual begali and hindi dictionary
- Wrote intrachunk rules (ben-hin.t1x hin-ben.t1x)
- Wrote Constraint Grammar( ben.rlx , hin.rlx)
Why should Google and Apertium sponsor it?[edit]
This project aims for a better translation system than the current ones, and also it will be a lot of help for people since these are very widely spoken languages. Moreover, the technique mentioned will be tried for the first time.
List all the non Summer-of -codes plans you have for summer[edit]
There is no non-summer of code plan for the time being. If any rescheduling of exams occurs, I will notify my mentors. Also, kept a week buffer to deal with that. I can give 50+ hours a week in general.