User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi
Project Progress can be seen here
- 1 Contact Information
- 2 Why I am interested in Apertium
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 My Proposal
- 4.1 Mentors/Experienced members in Contact
- 4.2 Brief of deliverables
- 4.3 Why Google and Apertium should sponsor it
- 4.4 How and who it will benefit in society
- 4.5 Google Translate : Analysis and comparison
- 4.6 Implementation choices
- 4.7 Current state of dictionaries
- 4.8 Resources
- 4.9 Workplan
- 5 Skills
- 6 Coding challenge
- 7 Non-Summer-of-Code plans for the Summer
Name: Priyank Modi
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)
Timezone: GMT +0530 hrs
Why I am interested in Apertium
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement.
Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc.
The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal).
A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!
Which of the published tasks are you interested in? What do you plan to do?
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
Mentors/Experienced members in Contact
Francis Tyers, Hèctor Alòs i Font
Brief of deliverables
- A morph based dictionary of Punjabi with ~16,000 words
- Improvements(current rules and word pairs) and additions to hin-pan bidictionary
- Lexical selection and transfer rules for the pair
- Translator for hin-pan and pan-hin with WER <20%
- Morphological disambiguator for the pair
I plan on achieving coverage close to hin-urd pair. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve.
Why Google and Apertium should sponsor it
- Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
- Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge).
- I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator.
- This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills)
- In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool.
How and who it will benefit in society
The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses.
Google Translate : Analysis and comparison
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):
- hin-pan: 79.23% WER
- hin-pan: 56.56% PER
- pan-hin: 82.23% WER
- pan-hin: 57.83% WER
The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text:
Original source text (Hindi):
गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
Google translation (Punjabi):
ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ.
The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.
Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it.
Translation achieved using Apertium model(Punjabi):
ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following :
- Google Translate relies on the n-grams available to it.
- In case of rarely used words, it fails to translate those and worse, fails to capture the tense.
- In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence.
- 3 stage transfer : I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology.
- Clean and consistent practices : As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified.
- AnnCorra Dependencies : In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. link to paper
- Manual Disambiguation : For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category.
- Transliteration : For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography.
- WX notations : I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages.
Current state of dictionaries
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
The problems in the current dictionaries include :
- Multiple unnecessary analyses. Fix - Keep only first analysis and add others, if required, using <e r="RL">
<pardef n="गलत__adj"> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e> </pardef>
- Inflections added separately. Fix - stick consistent to adding inflections to root in a single definition
<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e>
- Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. Fix - add some extra flag/comment
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e>
- Inconsistent/Statistically Incorrect pairs. Fix - Statistical and Manual disambiguation
<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e> <e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e> <e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e>
- Incomplete morph - analysis (all forms not added). Fix - Statistical and Manual fixes, involving comparisons and additionss
<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e>
Why haven't I fixed these yet? Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now.
I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously). Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way.
[to be added - under confirmation for public use]
Hindi-Punjabi Dictionary - wiktionary
Punjabi-Hindi dictionary - Glosbe (awaiting confirmation)
Punjabi Articles - Wikipedia
Punjabi Dictionary - Wiktionary
|PHASE||DURATION||GOALS OF THE WEEK||BIDIX||WER||Coverage|
|Post Application Period||
|Community Bonding Period : Closed Categories||
|Community Bonding Period : Adjectives||
|Week ONE: Verbal Paradigms||
|Week TWO: Dictionary Expansion||
|Week THREE: Dictionary Expansion||
||~ 6,500||< 25% (hin-pan)||> 65% (hin-pan) |
|Week FOUR: More works on verbs and testing||
|Week FIVE : focus on Nouns||
|Week SIX : Expanding Dictionaries||
First Evaluation(June 29th - July 3rd)
|Week SEVEN : Expanding Dictionaries||
|Week EIGHT : Transfer rules(hin-pan)||
|Week NINE : Transfer rules||
Second Evaluation(July 27th - July 31st)
|~ 15,000||<20% (hin-pan)
|>82% (hin-pan) |
|Week TWELVE : Testvoc||
|Week THIRTEEN : Finishing up||
PERSONAL CODE FREEZE : August 22nd
|Week FOURTEEN : Review||
Final evaluation(August 24th - August 31st)
|~ 17,000||~15% (hin-pan)
|~90% (hin-pan) |
I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data.
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020 and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. (ISA list of accepted papers, Link to paper)
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
I am fluent in English, Hindi and Punjabi.
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Coding challenge repository
Original corpus : source lang-hin
Translated output : target lang-pan
Human Translation : target lang-pan(human)
Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen)
WER achieved : 15.30 %
PER achieved : 15.03 %
Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these (link to paper) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful.
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair).
Non-Summer-of-Code plans for the Summer
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project).