User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi
Contents
Contact Information
Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM
Why I am interested in Apertium
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.
Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages. Anaphora Resolution without complex linguistic information is a challenge that I'll be tackling during this Summer of Code with Apertium.
A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.
I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!
Which of the published tasks are you interested in? What do you plan to do?
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
My Proposal
Why Google and Apertium should sponsor it
- Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
- Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 2.1) On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 3 : Coding Challenge).
- I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator for this pair.
- This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and can be an important resource.
How and who it will benefit in society
As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.
Google Translate : Analysis and comparison
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):
- hin-pan: 14.0% WER
- pan-hin: 21.6% WER
The results are far from wonderful, especially when it comes to longer sentences with less frequently used words. Seemingly, Google translates using both Spanish and English as bridge languages, as can be seen, for example, by words that appear in these two languages in the final text (supposedly in Catalan) and that were not in the original Italian or Portuguese text. The use of English as intermediate between Romance languages causes problems known to all users, such as the translation of p2.pl verb forms with elided subject to p2.sg, the incorrect choice of past times in the verbs and the disappearance of some pronouns. Here is an example of the last case of the Italian test text (randomly obtained):
Original text (bold mine):
altri invece ne hanno apprezzato la spontaneità, la tenacia e l'affettuosità
Google translation:
altres han apreciat la seva espontaneïtat, tenacitat i afecte
Post-edited translation:
altres n'han apreciat l'espontaneïtat, tenacitat i afecte
It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.
Current state of dictionaries
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here[insert link]. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
Resources
[to be added - under confirmation for public use]
Hindi-Punjabi Dictionary - wiktionary
Punjabi-Hindi dictionary - Glosbe (awaiting confirmation)
Punjabi Articles - Wikipedia
Punjabi Dictionary - Wiktionary
Workplan
PHASE | DURATION | GOALS OF THE WEEK | BIDIX | WER | Coverage |
---|---|---|---|---|---|
COMMUNITY BONDING PERIOD |
|
|
|||
Week ONE : CLOSED CATEGORIES |
|
|
|||
Week TWO : Adjectives |
|
|
|||
WEEK THREE: Verbal Paradigms |
|
|
~ 1,000 |
| |
Week FOUR: Dictionary Expansion |
|
|
~ 3,500 | ||
Week FIVE |
|
|
~ 5,500 | ||
Week SIX |
|
|
~ 7,500 | ||
Week SEVEN |
|
|
~ 9,000 | ||
Week EIGHT |
|
First Evaluation(June 29th - July 3rd) |
~ 7,500 | ||
Week NINE |
|
|
~ 7,500 | ||
Week TEN |
|
|
~ 7,500 | ||
Week ELEVEN |
|
|
~ 7,500 | ||
Week TWELVE |
|
Second Evaluation(July 27th - Jult 31st) |
~ 7,500 | ||
Week THIRTEEN |
|
|
~ 7,500 | ||
Week FOURTEEN |
|
|
~ 7,500 | ||
Week FIFTEEN |
|
|
~ 7,500 | ||
Week FIFTEEN |
|
Final evaluation(August 24th - August 31st) |
~ 7,500 |
Skills
I'm currently a third year(commencing start of April '20 hopefully :D ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more.
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
I also have a lot of experience studying and generating data which I feel is essential in solving any problem, especially the one mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020. I am working on extending the same for Punjabi using Transfer learning.
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a fLinux based shell etc. all of which required a working understanding of Natural Language Processing scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
I am fluent in English, Hindi and Punjabi.
Coding challenge
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Coding challenge repository
Original corpus : source lang-hin
Translated output : target lang-pan
Human Translation : target lang-pan(human)
Non-Summer-of-Code plans for the Summer
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 30-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a couple of weeks since the coursework is already underway online and is expected to be over before start of the project).