Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"
(118 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Project Progress can be seen [https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi/progress here] |
|||
== Contact Information == |
== Contact Information == |
||
'''Name:''' Priyank Modi<br /> |
'''Name:''' Priyank Modi<br /> |
||
'''Email:''' priyankmodi99@gmail.com<br /> |
'''Email:''' priyankmodi99@gmail.com<br /> |
||
'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses <br /> |
'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)<br /> |
||
'''IRC:''' pmodi<br /> |
'''IRC:''' pmodi<br /> |
||
'''Timezone:''' GMT +0530 hrs<br /> |
'''Timezone:''' GMT +0530 hrs<br /> |
||
'''Linkedin:''' https://www.linkedin.com/in/priyank-modi-81584b175/ <br /> |
'''Linkedin:''' https://www.linkedin.com/in/priyank-modi-81584b175/ <br /> |
||
'''Github:''' https://github.com/priyankmodiPM <br /> |
'''Github:''' https://github.com/priyankmodiPM <br /> |
||
'''Website:''' https://priyankmodipm.github.io/ <br /> |
|||
== Why I am interested in Apertium == |
== Why I am interested in Apertium == |
||
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement. |
|||
Because Apertium is free/open-source software.<br /> |
|||
Because its community is strongly committed to under-resourced and minoritised/marginalised languages. <br /> |
|||
Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc. |
|||
Because there is lot of good work done and being done in it. <br /> |
|||
Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc. <br /> |
|||
The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal). |
|||
A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers! |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
== Which of the published tasks are you interested in? What do you plan to do? == |
||
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state. |
'''Adopt an unreleased language pair.''' I plan on developing the Hindi-Punjabi language pair in both directions i.e. '''hin-pan and pan-hin'''. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state. |
||
== My Proposal == |
== My Proposal == |
||
=== Mentors/Experienced members in Contact === |
|||
Francis Tyers, Hèctor Alòs i Font |
|||
=== Brief of deliverables === |
|||
* A morph based dictionary of Punjabi with ~16,000 words |
|||
* Improvements(current rules and word pairs) and additions to hin-pan bidictionary |
|||
* Lexical selection and transfer rules for the pair |
|||
* Translator for hin-pan and pan-hin with WER <20% |
|||
* Morphological disambiguator for the pair |
|||
I plan on achieving coverage close to [http://wiki.apertium.org/wiki/Hindi_and_Urdu/Work_plan_(GSOC_2014) hin-urd pair]. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve. |
|||
=== Why Google and Apertium should sponsor it === |
=== Why Google and Apertium should sponsor it === |
||
* Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources. |
* Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources. |
||
* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section |
* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge). |
||
* I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator |
* I believe the Apertium architecture is suited perfectly for this pair and can '''replace the current state-of-art translator'''. |
||
* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and |
* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills) |
||
* In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool. |
|||
=== How and who it will benefit in society === |
=== How and who it will benefit in society === |
||
The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses. |
|||
=== Google Translate : Analysis and comparison === |
=== Google Translate : Analysis and comparison === |
||
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target): |
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target): |
||
* hin-pan: |
* hin-pan: 79.23% WER |
||
* |
* hin-pan: 56.56% PER |
||
* pan-hin: 82.23% WER |
|||
* pan-hin: 57.83% WER |
|||
The results are |
The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text: |
||
Original text ( |
Original source text (Hindi): |
||
<blockquote> |
<blockquote> |
||
'''गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत. <br>''' |
|||
altri invece '''ne''' hanno apprezzato la spontaneità, la tenacia e l'affettuosità |
|||
''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.'' |
|||
</blockquote> |
</blockquote> |
||
Google translation: |
Google translation (Punjabi): |
||
<blockquote> |
<blockquote> |
||
'''ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ. <br>''' |
|||
altres han apreciat '''la seva''' espontaneïtat, tenacitat i afecte |
|||
''The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.'' |
|||
<br> <br> |
|||
==== Note : ==== |
|||
Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it. |
|||
</blockquote> |
</blockquote> |
||
Translation achieved using Apertium model(Punjabi): |
|||
Post-edited translation: |
|||
<blockquote> |
<blockquote> |
||
'''ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ. <br>''' |
|||
altres '''n''''han apreciat l'espontaneïtat, tenacitat i afecte |
|||
''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.'' |
|||
</blockquote> |
</blockquote> |
||
It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following : |
|||
It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts. |
|||
* Google Translate relies on the n-grams available to it. |
|||
* In case of rarely used words, it fails to translate those and worse, fails to capture the tense. |
|||
* In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence. |
|||
=== Implementation choices === |
|||
* '''3 stage transfer :''' I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology. |
|||
* '''Clean and consistent practices :''' As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified. |
|||
* '''AnnCorra Dependencies :''' In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. [http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper] |
|||
* '''Manual Disambiguation :''' For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category. |
|||
* '''Transliteration :''' For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography. |
|||
* '''WX notations :''' I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages. |
|||
=== Current state of dictionaries === |
=== Current state of dictionaries === |
||
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these |
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these [https://docs.google.com/document/d/1GmBIlGVxMinhJJVZWnVWLHTs8vvKAuxA0vCLBDOyBJY/edit?usp=sharing here]. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way. |
||
The problems in the current dictionaries include : |
|||
*Multiple unnecessary analyses. '''Fix -''' Keep only first analysis and add others, if required, using <e r="RL"> |
|||
<pre> |
|||
<pardef n="गलत__adj"> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
*Inflections added separately. '''Fix -''' stick consistent to adding inflections to root in a single definition |
|||
<pre> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e> |
|||
</pre> |
|||
*Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. '''Fix -''' add some extra flag/comment |
|||
<pre> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e> |
|||
</pre> |
|||
* Inconsistent/Statistically Incorrect pairs. '''Fix -''' Statistical and Manual disambiguation |
|||
<pre> |
|||
<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e> |
|||
<e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> |
|||
<e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e> |
|||
<e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e> |
|||
<e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e> |
|||
</pre> |
|||
* Incomplete morph - analysis (all forms not added). '''Fix -''' Statistical and Manual fixes, involving comparisons and additionss |
|||
<pre> |
|||
<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e> |
|||
</pre> |
|||
'''Why haven't I fixed these yet?''' Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. <strike>I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously).</strike> Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way. |
|||
=== Resources === |
=== Resources === |
||
[to be added - under confirmation for public use] |
[to be added - under confirmation for public use] <br> |
||
[https://hi.wiktionary.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A5%8D%E0%A4%B7%E0%A4%A8%E0%A4%B0%E0%A5%80:%E0%A4%AA%E0%A4%82%E0%A4%9C%E0%A4%BE%E0%A4%AC%E0%A5%80-%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80_%E0%A4%B6%E0%A4%AC%E0%A5%8D%E0%A4%A6%E0%A4%95%E0%A5%8B%E0%A4%B6_%E0%A4%85_%E0%A4%B8%E0%A5%87_%E0%A4%94 Hindi-Punjabi Dictionary - wiktionary] <br> |
|||
[https://glosbe.com/pa/hi Punjabi-Hindi dictionary - Glosbe] (awaiting confirmation) <br> |
|||
[https://pa.wikipedia.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%B8%E0%A8%AB%E0%A8%BC%E0%A8%BE Punjabi Articles - Wikipedia] <br> |
|||
[https://pa.wiktionary.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%AA%E0%A9%B0%E0%A8%A8%E0%A8%BE Punjabi Dictionary - Wiktionary] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/pa/ Wikidumps-punjabi 1] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/pa/ Wikidumps-punjabi 2] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/pa/ Wikidumps-punjabi 3] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/hi/ Wikidumps-hindi 1] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/hi/ Wikidumps-hindi 2] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/hi/ Wikidumps-hindi 3] |
|||
=== Workplan === |
=== Workplan === |
||
Line 69: | Line 163: | ||
! style="width: 13%" | Coverage |
! style="width: 13%" | Coverage |
||
|-style="background-color:#dbfedb;" |
|-style="background-color:#dbfedb;" |
||
| style="text-align:center" | |
| style="text-align:center" | Post Application Period |
||
| |
| |
||
* START:April |
* START:April 6th |
||
* END:May |
* END:May 3rd |
||
| |
| |
||
* List and discuss implementation choices of hin-pan bidix and urd-hin pair |
* List and discuss implementation choices of hin-pan bidix and urd-hin pair |
||
Line 82: | Line 176: | ||
| |
| |
||
|-style="background-color:#b3ffb3;" |
|-style="background-color:#b3ffb3;" |
||
| style="text-align:center" | |
| style="text-align:center" | Community Bonding Period : Closed Categories |
||
| |
| |
||
* START:May |
* START:May 4th |
||
* END:May 24th |
* END:May 24th |
||
| |
| |
||
Line 92: | Line 186: | ||
| |
| |
||
|-style="background-color:#aaffaa;" |
|-style="background-color:#aaffaa;" |
||
| style="text-align:center" | |
| style="text-align:center" | Community Bonding Period : Adjectives |
||
| |
| |
||
* START:May 25th |
* START:May 25th |
||
Line 104: | Line 198: | ||
| |
| |
||
|-style="background-color:#91f991;;" |
|-style="background-color:#91f991;;" |
||
| style="text-align:center" | |
| style="text-align:center" | Week ONE: Verbal Paradigms |
||
| |
| |
||
* START:June 1st |
* START:June 1st |
||
Line 112: | Line 206: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules for verbs |
* Lexical selection rules for verbs |
||
* |
* testvoc : adj, adv |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 3,000 |
||
| |
| |
||
| |
| |
||
* Third version of the wrapper with all functions importable from python. |
|||
|-style="background-color:#82fa82;" |
|-style="background-color:#82fa82;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week TWO: Dictionary Expansion |
||
| |
| |
||
* START:June 8th |
* START:June 8th |
||
Line 125: | Line 218: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 5,000 |
||
| |
| |
||
| |
| |
||
|-style="background-color: #64ff64;" |
|-style="background-color: #64ff64;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week THREE: Dictionary Expansion |
||
| |
| |
||
* START:June 15th |
* START:June 15th |
||
Line 136: | Line 229: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 6,500 |
||
| style="text-align:center" | < 25% (hin-pan) |
|||
| |
|||
| style="text-align:center" | > 65% (hin-pan) <br> >60% (pan-hin) |
|||
| |
|||
|-style="background-color:#3dff3d;" |
|-style="background-color:#3dff3d;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week FOUR: More works on verbs and testing |
||
| |
| |
||
* START:June 15th |
* START:June 15th |
||
Line 147: | Line 240: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
* Manual Disambiguation of rules hin-pan(src-trg) |
|||
| style="text-align:center" | ~ 7,500 |
| style="text-align:center" | ~ 7,500 |
||
| |
| |
||
| |
| |
||
|-style="background-color:#13ff13;" |
|-style="background-color:#13ff13;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week FIVE : focus on Nouns |
||
| |
| |
||
* START:June 22nd |
* START:June 22nd |
||
Line 161: | Line 255: | ||
| |
| |
||
| |
| |
||
|-style="background-color:# |
|-style="background-color:#00ee00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week SIX : Expanding Dictionaries |
||
| |
| |
||
* START:June 29th |
* START:June 29th |
||
Line 170: | Line 264: | ||
* Lexical selection rules |
* Lexical selection rules |
||
'''First Evaluation(June 29th - July 3rd)''' |
'''First Evaluation(June 29th - July 3rd)''' |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 10,500 |
||
| |
| |
||
| |
| |
||
|-style="background-color: # |
|-style="background-color: #00df00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week SEVEN : Expanding Dictionaries |
||
| |
| |
||
* START:July 6th |
* START:July 6th |
||
Line 181: | Line 275: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
* Manual disambiguation of rules(pan-hin) |
|||
| style="text-align:center" | ~ 7,500 |
|||
| style="text-align:center" | ~ 12,000 |
|||
| |
| |
||
| |
| |
||
|-style="background-color:#00d700;" |
|-style="background-color:#00d700;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week EIGHT : Transfer rules(hin-pan) |
||
| |
| |
||
* START:July 13th |
* START:July 13th |
||
Line 192: | Line 287: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
* Transfer rules(hin-pan) |
|||
| style="text-align:center" | ~ 7,500 |
|||
| style="text-align:center" | ~ 13,000 |
|||
| |
| |
||
| |
| |
||
|-style="background-color:#00b600;" |
|-style="background-color:#00b600;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week NINE : Transfer rules |
||
| |
| |
||
* START:July 20th |
* START:July 20th |
||
Line 203: | Line 299: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
* Transfer rules : pan-hin |
|||
| style="text-align:center" | ~ 7,500 |
|||
| style="text-align:center" | ~ 14,000 |
|||
| |
| |
||
| |
| |
||
|-style="background-color:#009d00;" |
|-style="background-color:#009d00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week TEN |
||
| |
| |
||
* START:July 27th |
* START:July 27th |
||
Line 214: | Line 311: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
'''Second Evaluation(July 27th - |
'''Second Evaluation(July 27th - July 31st)''' |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 15,000 |
||
| style="text-align:center" | <20% (hin-pan) <br> <25% (pan-hin) |
|||
| |
|||
| style="text-align:center" | >82% (hin-pan) <br> >77% (pan-hin) |
|||
| |
|||
|-style="background-color:#008a00;" |
|-style="background-color:#008a00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week ELEVEN |
||
| |
| |
||
* START:August 3rd |
* START:August 3rd |
||
Line 226: | Line 323: | ||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary |
||
* Lexical selection rules |
* Lexical selection rules |
||
* Disambiguation rules |
|||
| style="text-align:center" | ~ 7,500 |
|||
* Transfer rules |
|||
| style="text-align:center" | ~ 16,000 |
|||
| |
| |
||
| |
| |
||
|-style="background-color:#007e00;" |
|-style="background-color:#007e00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week TWELVE : Testvoc |
||
| |
| |
||
* START:August 10th |
* START:August 10th |
||
* END:August 16th |
* END:August 16th |
||
| |
| |
||
* Testvoc hin-pan |
|||
* Expanding bilingual dictionary |
|||
* |
* Add rules, words |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 16,500 |
||
| |
| |
||
| |
| |
||
|-style="background-color:#006d00;" |
|-style="background-color:#006d00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week THIRTEEN : Finishing up |
||
| |
| |
||
* START:August 17th |
* START:August 17th |
||
* END:August 23rd |
* END:August 23rd |
||
| |
| |
||
* Testvoc pan-hin |
|||
* Expanding bilingual dictionary |
|||
* |
* Add rules, words |
||
'''PERSONAL CODE FREEZE : August 22nd''' |
|||
| style="text-align:center" | ~ 7,500 |
|||
| style="text-align:center" | ~ 17,000 |
|||
| |
| |
||
| |
| |
||
|-style="background-color:#005a00;" |
|-style="background-color:#005a00;" |
||
| style="text-align:center" | Week |
| style="text-align:center" | Week FOURTEEN : Review |
||
| |
| |
||
* START:August 24th |
* START:August 24th |
||
* END:August 30th |
* END:August 30th |
||
| |
| |
||
* Review and documentation |
|||
* Expanding bilingual dictionary |
|||
* Lexical selection rules |
|||
'''Final evaluation(August 24th - August 31st) |
'''Final evaluation(August 24th - August 31st) |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~ 17,000 |
||
| style="text-align:center" | ~15% (hin-pan) <br> <20% (pan-hin) |
|||
| |
|||
| style="text-align:center" | ~90% (hin-pan) <br> ~83% (pan-hin) |
|||
| |
|||
|} |
|} |
||
== Skills == |
== Skills == |
||
I'm currently a third year( |
I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data. |
||
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well. |
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well. |
||
I also have a lot of experience studying and generating data which I feel is |
I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on ''''Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus'''' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at '''LREC 2020''' and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. ([https://sigsem.uvt.nl/isa16/ ISA list of accepted papers], [https://www.researchgate.net/publication/340266259_Hindi_TimeBank_An_ISO-TimeML_Annotated_Reference_Corpus Link to paper]) |
||
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same. |
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same. |
||
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, etc. all of which required a working understanding of Natural Language Processing. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required. |
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required. |
||
I am fluent in English, Hindi and Punjabi. |
I am fluent in English, Hindi and Punjabi. |
||
== Coding challenge == |
== Coding challenge == |
||
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : |
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi Coding challenge repository] <br> |
||
Original corpus : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/texts/story.hin.txt source lang-hin] <br> |
|||
Original corpus(source lang-hin) - |
|||
Translated output |
Translated output : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt target lang-pan] <br> |
||
Human Translation : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/pan_original_translation target lang-pan(human)] <br> |
|||
Human Translation(pan) - |
|||
Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)<br> |
|||
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen) <br> |
|||
WER achieved : 15.30 % <br> |
|||
PER achieved : 15.03 % |
|||
Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these ([http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper]) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful. <br> |
|||
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair). |
|||
== Non-Summer-of-Code plans for the Summer == |
== Non-Summer-of-Code plans for the Summer == |
||
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend |
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project). |
||
[[Category:GSoC 2020 student proposals]] |
Latest revision as of 19:32, 4 June 2020
Project Progress can be seen here
Contents
- 1 Contact Information
- 2 Why I am interested in Apertium
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 My Proposal
- 5 Skills
- 6 Coding challenge
- 7 Non-Summer-of-Code plans for the Summer
Contact Information[edit]
Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM
Website: https://priyankmodipm.github.io/
Why I am interested in Apertium[edit]
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement.
Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc.
The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal).
A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!
Which of the published tasks are you interested in? What do you plan to do?[edit]
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
My Proposal[edit]
Mentors/Experienced members in Contact[edit]
Francis Tyers, Hèctor Alòs i Font
Brief of deliverables[edit]
- A morph based dictionary of Punjabi with ~16,000 words
- Improvements(current rules and word pairs) and additions to hin-pan bidictionary
- Lexical selection and transfer rules for the pair
- Translator for hin-pan and pan-hin with WER <20%
- Morphological disambiguator for the pair
I plan on achieving coverage close to hin-urd pair. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve.
Why Google and Apertium should sponsor it[edit]
- Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
- Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge).
- I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator.
- This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills)
- In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool.
How and who it will benefit in society[edit]
The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses.
Google Translate : Analysis and comparison[edit]
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):
- hin-pan: 79.23% WER
- hin-pan: 56.56% PER
- pan-hin: 82.23% WER
- pan-hin: 57.83% WER
The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text:
Original source text (Hindi):
गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
Google translation (Punjabi):
ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ.
The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.
Note :[edit]
Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it.
Translation achieved using Apertium model(Punjabi):
ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following :
- Google Translate relies on the n-grams available to it.
- In case of rarely used words, it fails to translate those and worse, fails to capture the tense.
- In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence.
Implementation choices[edit]
- 3 stage transfer : I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology.
- Clean and consistent practices : As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified.
- AnnCorra Dependencies : In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. link to paper
- Manual Disambiguation : For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category.
- Transliteration : For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography.
- WX notations : I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages.
Current state of dictionaries[edit]
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
The problems in the current dictionaries include :
- Multiple unnecessary analyses. Fix - Keep only first analysis and add others, if required, using <e r="RL">
<pardef n="गलत__adj"> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e> </pardef>
- Inflections added separately. Fix - stick consistent to adding inflections to root in a single definition
<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e>
- Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. Fix - add some extra flag/comment
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e>
- Inconsistent/Statistically Incorrect pairs. Fix - Statistical and Manual disambiguation
<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e> <e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e> <e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e>
- Incomplete morph - analysis (all forms not added). Fix - Statistical and Manual fixes, involving comparisons and additionss
<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e>
Why haven't I fixed these yet? Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously). Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way.
Resources[edit]
[to be added - under confirmation for public use]
Hindi-Punjabi Dictionary - wiktionary
Punjabi-Hindi dictionary - Glosbe (awaiting confirmation)
Punjabi Articles - Wikipedia
Punjabi Dictionary - Wiktionary
Wikidumps-punjabi 1
Wikidumps-punjabi 2
Wikidumps-punjabi 3
Wikidumps-hindi 1
Wikidumps-hindi 2
Wikidumps-hindi 3
Workplan[edit]
PHASE | DURATION | GOALS OF THE WEEK | BIDIX | WER | Coverage |
---|---|---|---|---|---|
Post Application Period |
|
|
|||
Community Bonding Period : Closed Categories |
|
|
|||
Community Bonding Period : Adjectives |
|
|
|||
Week ONE: Verbal Paradigms |
|
|
~ 3,000 | ||
Week TWO: Dictionary Expansion |
|
|
~ 5,000 | ||
Week THREE: Dictionary Expansion |
|
|
~ 6,500 | < 25% (hin-pan) | > 65% (hin-pan) >60% (pan-hin) |
Week FOUR: More works on verbs and testing |
|
|
~ 7,500 | ||
Week FIVE : focus on Nouns |
|
|
~ 9,000 | ||
Week SIX : Expanding Dictionaries |
|
First Evaluation(June 29th - July 3rd) |
~ 10,500 | ||
Week SEVEN : Expanding Dictionaries |
|
|
~ 12,000 | ||
Week EIGHT : Transfer rules(hin-pan) |
|
|
~ 13,000 | ||
Week NINE : Transfer rules |
|
|
~ 14,000 | ||
Week TEN |
|
Second Evaluation(July 27th - July 31st) |
~ 15,000 | <20% (hin-pan) <25% (pan-hin) |
>82% (hin-pan) >77% (pan-hin) |
Week ELEVEN |
|
|
~ 16,000 | ||
Week TWELVE : Testvoc |
|
|
~ 16,500 | ||
Week THIRTEEN : Finishing up |
|
PERSONAL CODE FREEZE : August 22nd |
~ 17,000 | ||
Week FOURTEEN : Review |
|
Final evaluation(August 24th - August 31st) |
~ 17,000 | ~15% (hin-pan) <20% (pan-hin) |
~90% (hin-pan) ~83% (pan-hin) |
Skills[edit]
I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data.
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020 and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. (ISA list of accepted papers, Link to paper)
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
I am fluent in English, Hindi and Punjabi.
Coding challenge[edit]
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Coding challenge repository
Original corpus : source lang-hin
Translated output : target lang-pan
Human Translation : target lang-pan(human)
Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen)
WER achieved : 15.30 %
PER achieved : 15.03 %
Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these (link to paper) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful.
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair).
Non-Summer-of-Code plans for the Summer[edit]
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project).