Difference between revisions of "User:Agneet42/proposal"
(22 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[GSoC 2017 Student Proposals]] |
[[Category: GSoC 2017 Student Proposals]] |
||
== Contact Info == |
|||
User: Agneet42/proposal |
|||
'''Name:''' Agneet Chatterjee |
|||
'''E-mail:''' agneet257@gmail.com |
|||
'''IRC:''' agneet42 |
|||
'''Location:''' India |
|||
'''Timezone:''' UTC+05:30 |
|||
=Why is it you are interested in machine translation?= |
|||
"Because language plays such a fundamental part in connecting each of us as thinking creatures with the world around us, the subtle nuances of language (which are different even in similar tongues, say the Latin-derived Spanish and Portuguese) actually shape how we think about the world. Learning something of how somebody else speaks from a foreign country actually helps you to understand their mindset a little." |
|||
I am interested in Machine Translation primarily for two reasons; Firstly, I believe that in this generation of information exchange, one of the biggest challenges is sharing and understanding knowledge in different languages. This is where machine translation comes into picture and interests me for it works for a unified purpose. Secondly, I have deep-rooted interests coupled with experience in the field of Natural Language processing. And I hope to make a difference in the field of machine translation. |
|||
=Why is it that you are interested in the Apertium project?= |
|||
Apertium is free/open-source machine translation platform, which means that developers from all over the world can join and work upon new language pair/s to facilitate better translation. Apertium uses Unix “pipelines” which is very useful for quick diagnosis and debugging, enabling me to use additional modules between existing modules, like using the HFST(Helsinki finite-state transducer) for morphological analysis. Furthermore, Apertium uses the novel approach of Rule Based Machine Translation where no bilingual texts are required which makes it possible to create translation systems for languages that have no texts in common, or even no digitized data whatsoever and also RBMT is domain independent which means that rules are usually written in a domain independent manner, so the vast majority of rules will "just work" in every domain, and only a few specific cases per domain may need rules written for them. |
|||
=Which of the published tasks are you interested in?= |
|||
Adopting the Hindi<->Bengali language pair. |
|||
==Why should Google and Apertium sponsor it?== |
|||
Firstly, Hindi and Bengali are respectively the 4th and 7th most spoken languages in the world with ~295 and ~200 million speakers each. And more so, the speakers of these languages are spread all across the globe. A hindi-bengali translation will not only aid speakers but also facilitate business transactions happening in these bustling business havens. |
|||
Currently, there is no single go-to platform for Machine Translation between these two languages, the only one being Google Translate but it has it's own limitations: |
|||
# They are not available offline, therefore less accessible. |
|||
# They are not open source. Not everybody can contribute. |
|||
Apertium makes sure that the above issues do not come in it's path, and that is what makes it a suitable developmental ground for this (or any other) language pair. Furthermore, a hindi-bengali translation will make it easier for translation of similar languages like bengali such as hindi-assamese and hindi-oriya. |
|||
==How and who will benefit in society?== |
|||
The monolingual dictionaries can be used as a stemmer for any search engine for Hindi/Bengali. It could also used as a spell checker. The effect of these in other applications like, anaphora resolution, question answering can also be explored. The hindi-bengali translation will also help in accurate translation of manuscripts that are widely present in both the languages and make available the culture of both the forums to each other. |
|||
==Literature Review== |
|||
Hindi and Bengali both originated from Old Indo-Aryan family of languages and are similar in structure. They have lot of similarities even though there are differences in the form of uses and positions of the words in corresponding sentences. Hindi pronouns can be broadly categorized into seven types namely, Personal,Demonstrative, Indefinite, Relative-Correlative, Possessive, Interrogative and Reflexive. Among these Hindi pronouns some are used both as Personal, Demonstrative, and Relative-Correlative pronouns. In Bengali, there are different pronouns for each of these uses. As the list of Hindi such pronouns is small and their uses are limited, it is possible to differentiate each use and find their Bengali translations using a set of linguistic rules. Some Hindi pronouns are used to demonstrate both animate and inanimate nouns and as third person personal pronouns. For these three uses a single Hindi pronoun is used where in Bengali there are dedicated pronouns for each use. Given such a Hindi pronoun, we have to find its use in the corresponding sentence and translate it to corresponding Bengali pronoun. |
|||
Normally when two stems join together the inflectional suffix of the first stem remains unspecified in the resulting compound word. For example, the compound word “mAmA-bArI” (মামা বাড়ি)is actually the word “mAmAr bAri” (মামার বাড়ি) where “r” is the inflectional suffix for stem “mAmA”. That “r” is deleted when compound word is formed. This is called inflection deletion in compound words. So, when a inflectional suffix is found at |
|||
the end of a compound word, it is presumed to be the inflectional suffix of the compound word, and not the inflectional suffix of the last stem. If all compound words followed the above inflection deletion then we should conclude that there is just one inflectional suffix for every compound word. |
|||
== Current scenario == |
|||
Presently,there exists a bengali and a hindi monolingual dictionary and a bengali-hindi bidix. In the bengali dictionary, the coverage needs to be expanded. The verb section faces difficulty in treating multi-word verbs and the negative form are not well recognised. This is because several forms of the verb like infinitives and participles demand a negative particle before the verb while fnite forms require the particle to follow the verb and in some cases as enclitic. The bilingual dictionary deals with very less coverage and misses out on very common and important nouns. |
|||
The goal of this project is to expand both the monolingual dictionaries and the bilingual dictionaries along with adding selection(structural and lexical) rules and constraint grammar rules, as they appear. |
|||
==Solution/s== |
|||
# Hindi nominal suffixes को(ko),का(kA),से(se),पे(pe), etc. are also used with Hindi pronouns. Different uses of these Hindi pronominal suffixes have different Bengali translations. The most frequent corresponding Bengali pronominal suffixes are কে(ke), র(ra),থেকে(theke),এ(e), etc. The suffixes which have different translations can be disambiguated with the help of rules. |
|||
# |
|||
# Add a constraint grammar which deals effectively with tenses/prepositions/verbs. CG should also handle other-POS ambiguities. |
|||
# Tagger training to solve inflection issues. |
|||
==Note== |
|||
Since majority of the nominal lexical items in a language is an open set of words so it’s very difficult to include all the roots of Bengali in the Dictionary file at one go. The roots can always be increased with time leading to the automatic increase in the level of accuracy value. |
|||
==Work Plan== |
|||
====Major Goals==== |
|||
# Add rules:transfer, lexical, constraint grammar. |
|||
# Good WER on hin<->ben language pair. (Target<=20%) |
|||
# Target Coverage ~ 70% |
|||
# Clean testvoc. |
|||
====Post-Application Period==== |
|||
# Finish Coding challenge with WER~55% |
|||
# Learn Constraint Grammar and Lexical Selection rules. |
|||
# Add/Edit/Expand both the hindi and bengali dictionaries |
|||
====Community Bonding Period==== |
|||
# Study ways and resources which could automate significant portions of the task |
|||
# Get monolingual and bilingual aligned corpora for further analysis. |
|||
# Learn to use dictionaries and tools in practice. |
|||
# Prepare a list of words sorted by frequency of occurrence for Hindi and Bengali dictionary. |
|||
====Week 1==== |
|||
# Write test scripts (make use of the existing language-pair regression and corpus tests) |
|||
# Even up noun entries in the dixes according to the frequency list. Testvoc nouns. |
|||
====Week 2==== |
|||
# Continue evening up nouns. Testvoc nouns. |
|||
# Improve morphological analyser of apertium-ben, if possible. |
|||
====Week 3==== |
|||
# Even up verbs, adjectives and adverbs. Testvoc. |
|||
# Add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries. |
|||
====Week 4==== |
|||
# Continue evening up verbs, adjectives and adverbs. Testvoc. |
|||
# Work on Bengali-Hindi bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list. |
|||
<u>Deliverable #1: Dictionaries covering most of the words for both languages</u> |
|||
====Week 5==== |
|||
# Test for (@,#,*) and remove them. |
|||
# Gather translational data with the use of parallel corpora. |
|||
# WER expected to go down by 10-15%. |
|||
# Add basic transfer rules[generated from post-edit] for the purpose of testing, verify the tag definition files. |
|||
====Week 6==== |
|||
# Add more transfer rules for pronouns, verbs, adverbs and adjectives. |
|||
# Work further on the bilingual dictionary. |
|||
# Work on morphological disambiguation. |
|||
====Week 7==== |
|||
# Prepare a list of word sequences that frequently appear together for both the languages(using the apriori algorithm) |
|||
# Add multiwords with translations to the dictionaries. |
|||
# Work further on transfer rules[verbs etc.] |
|||
# Clean testvoc for all classes. |
|||
# Expected results : WER ~ 30%, Coverage ~ 60% |
|||
====Week 8==== |
|||
# Work on Morphological Disambiguation. |
|||
<u> Deliverable #2: Bilingual dictionary completed and some Morphological disambiguation done. </u> |
|||
====Week 9==== |
|||
# Obtain hand-tagged training corpora. |
|||
# Work on tag definition files. |
|||
# Carry out supervised tagger training for the languages. |
|||
# Work on Constraint Grammar and disambiguations. |
|||
# Start with Chunking. |
|||
====Week 10==== |
|||
# Work on transfer rules. |
|||
# Continue work on Constraint Grammar. |
|||
# Work up on chunking.[t2x] |
|||
# Start with post-chunking.[t3x] |
|||
====Week 11==== |
|||
# Thorough Regression and Corpus Testing. |
|||
# Check dictionaries manually to spot possible errors followed by bugfixes(if any) |
|||
====Week 12==== |
|||
# Evaluation of results and documentation. |
|||
# WER ~ <=20%, Coverage ~ 70% |
|||
<u> Deliverable #3: Language pair ready for or close to trunk </u> |
|||
=Skills and Qualifications= |
|||
I am a second year undergrad pursuing Computer Science and Technology from Jadavpur University, Kolkata, India. I have a steady coding experience in C, C++, Java and Python. I am also comfortable with XML and bash scripting. I have worked with the the Python libraries of TensorFlow, numpy and OpenCV before. I have an interest and a passion for Natural Language Processing and it's variants and I have undertaken projects in the same domain. Word Recognition using LSTM's[https://github.com/agneet42/Word-Spotting], AudioQA[https://github.com/agneet42/Q4AMRE] and Deep Speech. Furthermore, in the near future I wish to undertake the following project/s : Entailment and VisualQA. |
|||
When it comes to languages, I am a native Bengali who can read/write both English and Hindi, and have learned Spanish and French over the years. |
|||
=Coding Challenge= |
|||
To be <u>upgraded</u> post-application period: https://github.com/agneet42/Apertium-Ben-Hin |
|||
# Set up the working environment (installation and configuration). |
|||
# Entries added into both the monolingual and bilingual dictionaries. |
|||
# Taking into consideration the Bengali nouns, all the nominal suffixes have been identified. It has been noticed that the Bengali nominal suffixes consist of the classifiers, case markers and emphatic markers. |
|||
# Working on the bengali translation of this[http://www.unilang.org/ulrview.php?res=394,405]. |
|||
=Other commitments= |
|||
I have no other commitments this summer. I plan on giving around 42 hours per week to the project. My college commences from August, even after which I can give ~30 hours for the project. I also do not have any plans of an impending vacation. |
|||
=References= |
|||
# Translations of Ambiguous Hindi Pronouns to Possible Bengali Pronouns [http://www.aclweb.org/anthology/W12-5214] |
|||
# Development of a morphological analyser for Bengali [http://www.mt-archive.info/FreeRBMT-2009-Faridee.pdf] |
|||
# Bengali Noun Morphological Analyzer [http://ieeexplore.ieee.org/document/6637408/] |
|||
# Morphological Analysis of Inflecting Compound Words in Bangla [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.4988&rep=rep1&type=pdf] |
|||
# A long introduction to transfer rules [http://wiki.apertium.org/w/index.php?title=User:Ariessa/proposal&action=edit] |
|||
# Chunking: A full example [http://wiki.apertium.org/wiki/Chunking:_A_full_example] |
|||
# Monodix Basics [http://wiki.apertium.org/wiki/Monodix_basics] |
|||
# Finding errors in dictionaries [http://wiki.apertium.org/wiki/Finding_errors_in_dictionaries] |
|||
# Apertium: Official Documentation [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf] |
Latest revision as of 13:52, 3 April 2017
Contents
- 1 Contact Info
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in?
- 5 Skills and Qualifications
- 6 Coding Challenge
- 7 Other commitments
- 8 References
Contact Info[edit]
Name: Agneet Chatterjee
E-mail: agneet257@gmail.com
IRC: agneet42
Location: India
Timezone: UTC+05:30
Why is it you are interested in machine translation?[edit]
"Because language plays such a fundamental part in connecting each of us as thinking creatures with the world around us, the subtle nuances of language (which are different even in similar tongues, say the Latin-derived Spanish and Portuguese) actually shape how we think about the world. Learning something of how somebody else speaks from a foreign country actually helps you to understand their mindset a little." I am interested in Machine Translation primarily for two reasons; Firstly, I believe that in this generation of information exchange, one of the biggest challenges is sharing and understanding knowledge in different languages. This is where machine translation comes into picture and interests me for it works for a unified purpose. Secondly, I have deep-rooted interests coupled with experience in the field of Natural Language processing. And I hope to make a difference in the field of machine translation.
Why is it that you are interested in the Apertium project?[edit]
Apertium is free/open-source machine translation platform, which means that developers from all over the world can join and work upon new language pair/s to facilitate better translation. Apertium uses Unix “pipelines” which is very useful for quick diagnosis and debugging, enabling me to use additional modules between existing modules, like using the HFST(Helsinki finite-state transducer) for morphological analysis. Furthermore, Apertium uses the novel approach of Rule Based Machine Translation where no bilingual texts are required which makes it possible to create translation systems for languages that have no texts in common, or even no digitized data whatsoever and also RBMT is domain independent which means that rules are usually written in a domain independent manner, so the vast majority of rules will "just work" in every domain, and only a few specific cases per domain may need rules written for them.
Which of the published tasks are you interested in?[edit]
Adopting the Hindi<->Bengali language pair.
Why should Google and Apertium sponsor it?[edit]
Firstly, Hindi and Bengali are respectively the 4th and 7th most spoken languages in the world with ~295 and ~200 million speakers each. And more so, the speakers of these languages are spread all across the globe. A hindi-bengali translation will not only aid speakers but also facilitate business transactions happening in these bustling business havens.
Currently, there is no single go-to platform for Machine Translation between these two languages, the only one being Google Translate but it has it's own limitations:
- They are not available offline, therefore less accessible.
- They are not open source. Not everybody can contribute.
Apertium makes sure that the above issues do not come in it's path, and that is what makes it a suitable developmental ground for this (or any other) language pair. Furthermore, a hindi-bengali translation will make it easier for translation of similar languages like bengali such as hindi-assamese and hindi-oriya.
How and who will benefit in society?[edit]
The monolingual dictionaries can be used as a stemmer for any search engine for Hindi/Bengali. It could also used as a spell checker. The effect of these in other applications like, anaphora resolution, question answering can also be explored. The hindi-bengali translation will also help in accurate translation of manuscripts that are widely present in both the languages and make available the culture of both the forums to each other.
Literature Review[edit]
Hindi and Bengali both originated from Old Indo-Aryan family of languages and are similar in structure. They have lot of similarities even though there are differences in the form of uses and positions of the words in corresponding sentences. Hindi pronouns can be broadly categorized into seven types namely, Personal,Demonstrative, Indefinite, Relative-Correlative, Possessive, Interrogative and Reflexive. Among these Hindi pronouns some are used both as Personal, Demonstrative, and Relative-Correlative pronouns. In Bengali, there are different pronouns for each of these uses. As the list of Hindi such pronouns is small and their uses are limited, it is possible to differentiate each use and find their Bengali translations using a set of linguistic rules. Some Hindi pronouns are used to demonstrate both animate and inanimate nouns and as third person personal pronouns. For these three uses a single Hindi pronoun is used where in Bengali there are dedicated pronouns for each use. Given such a Hindi pronoun, we have to find its use in the corresponding sentence and translate it to corresponding Bengali pronoun. Normally when two stems join together the inflectional suffix of the first stem remains unspecified in the resulting compound word. For example, the compound word “mAmA-bArI” (মামা বাড়ি)is actually the word “mAmAr bAri” (মামার বাড়ি) where “r” is the inflectional suffix for stem “mAmA”. That “r” is deleted when compound word is formed. This is called inflection deletion in compound words. So, when a inflectional suffix is found at the end of a compound word, it is presumed to be the inflectional suffix of the compound word, and not the inflectional suffix of the last stem. If all compound words followed the above inflection deletion then we should conclude that there is just one inflectional suffix for every compound word.
Current scenario[edit]
Presently,there exists a bengali and a hindi monolingual dictionary and a bengali-hindi bidix. In the bengali dictionary, the coverage needs to be expanded. The verb section faces difficulty in treating multi-word verbs and the negative form are not well recognised. This is because several forms of the verb like infinitives and participles demand a negative particle before the verb while fnite forms require the particle to follow the verb and in some cases as enclitic. The bilingual dictionary deals with very less coverage and misses out on very common and important nouns.
The goal of this project is to expand both the monolingual dictionaries and the bilingual dictionaries along with adding selection(structural and lexical) rules and constraint grammar rules, as they appear.
Solution/s[edit]
- Hindi nominal suffixes को(ko),का(kA),से(se),पे(pe), etc. are also used with Hindi pronouns. Different uses of these Hindi pronominal suffixes have different Bengali translations. The most frequent corresponding Bengali pronominal suffixes are কে(ke), র(ra),থেকে(theke),এ(e), etc. The suffixes which have different translations can be disambiguated with the help of rules.
- Add a constraint grammar which deals effectively with tenses/prepositions/verbs. CG should also handle other-POS ambiguities.
- Tagger training to solve inflection issues.
Note[edit]
Since majority of the nominal lexical items in a language is an open set of words so it’s very difficult to include all the roots of Bengali in the Dictionary file at one go. The roots can always be increased with time leading to the automatic increase in the level of accuracy value.
Work Plan[edit]
Major Goals[edit]
- Add rules:transfer, lexical, constraint grammar.
- Good WER on hin<->ben language pair. (Target<=20%)
- Target Coverage ~ 70%
- Clean testvoc.
Post-Application Period[edit]
- Finish Coding challenge with WER~55%
- Learn Constraint Grammar and Lexical Selection rules.
- Add/Edit/Expand both the hindi and bengali dictionaries
Community Bonding Period[edit]
- Study ways and resources which could automate significant portions of the task
- Get monolingual and bilingual aligned corpora for further analysis.
- Learn to use dictionaries and tools in practice.
- Prepare a list of words sorted by frequency of occurrence for Hindi and Bengali dictionary.
Week 1[edit]
- Write test scripts (make use of the existing language-pair regression and corpus tests)
- Even up noun entries in the dixes according to the frequency list. Testvoc nouns.
Week 2[edit]
- Continue evening up nouns. Testvoc nouns.
- Improve morphological analyser of apertium-ben, if possible.
Week 3[edit]
- Even up verbs, adjectives and adverbs. Testvoc.
- Add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries.
Week 4[edit]
- Continue evening up verbs, adjectives and adverbs. Testvoc.
- Work on Bengali-Hindi bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list.
Deliverable #1: Dictionaries covering most of the words for both languages
Week 5[edit]
- Test for (@,#,*) and remove them.
- Gather translational data with the use of parallel corpora.
- WER expected to go down by 10-15%.
- Add basic transfer rules[generated from post-edit] for the purpose of testing, verify the tag definition files.
Week 6[edit]
- Add more transfer rules for pronouns, verbs, adverbs and adjectives.
- Work further on the bilingual dictionary.
- Work on morphological disambiguation.
Week 7[edit]
- Prepare a list of word sequences that frequently appear together for both the languages(using the apriori algorithm)
- Add multiwords with translations to the dictionaries.
- Work further on transfer rules[verbs etc.]
- Clean testvoc for all classes.
- Expected results : WER ~ 30%, Coverage ~ 60%
Week 8[edit]
- Work on Morphological Disambiguation.
Deliverable #2: Bilingual dictionary completed and some Morphological disambiguation done.
Week 9[edit]
- Obtain hand-tagged training corpora.
- Work on tag definition files.
- Carry out supervised tagger training for the languages.
- Work on Constraint Grammar and disambiguations.
- Start with Chunking.
Week 10[edit]
- Work on transfer rules.
- Continue work on Constraint Grammar.
- Work up on chunking.[t2x]
- Start with post-chunking.[t3x]
Week 11[edit]
- Thorough Regression and Corpus Testing.
- Check dictionaries manually to spot possible errors followed by bugfixes(if any)
Week 12[edit]
- Evaluation of results and documentation.
- WER ~ <=20%, Coverage ~ 70%
Deliverable #3: Language pair ready for or close to trunk
Skills and Qualifications[edit]
I am a second year undergrad pursuing Computer Science and Technology from Jadavpur University, Kolkata, India. I have a steady coding experience in C, C++, Java and Python. I am also comfortable with XML and bash scripting. I have worked with the the Python libraries of TensorFlow, numpy and OpenCV before. I have an interest and a passion for Natural Language Processing and it's variants and I have undertaken projects in the same domain. Word Recognition using LSTM's[1], AudioQA[2] and Deep Speech. Furthermore, in the near future I wish to undertake the following project/s : Entailment and VisualQA. When it comes to languages, I am a native Bengali who can read/write both English and Hindi, and have learned Spanish and French over the years.
Coding Challenge[edit]
To be upgraded post-application period: https://github.com/agneet42/Apertium-Ben-Hin
- Set up the working environment (installation and configuration).
- Entries added into both the monolingual and bilingual dictionaries.
- Taking into consideration the Bengali nouns, all the nominal suffixes have been identified. It has been noticed that the Bengali nominal suffixes consist of the classifiers, case markers and emphatic markers.
- Working on the bengali translation of this[3].
Other commitments[edit]
I have no other commitments this summer. I plan on giving around 42 hours per week to the project. My college commences from August, even after which I can give ~30 hours for the project. I also do not have any plans of an impending vacation.
References[edit]
- Translations of Ambiguous Hindi Pronouns to Possible Bengali Pronouns [4]
- Development of a morphological analyser for Bengali [5]
- Bengali Noun Morphological Analyzer [6]
- Morphological Analysis of Inflecting Compound Words in Bangla [7]
- A long introduction to transfer rules [8]
- Chunking: A full example [9]
- Monodix Basics [10]
- Finding errors in dictionaries [11]
- Apertium: Official Documentation [12]