Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"
(133 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Project Progress can be seen [https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi/progress here] |
|||
== Contact Information == |
== Contact Information == |
||
'''Name:''' Priyank Modi<br /> |
'''Name:''' Priyank Modi<br /> |
||
'''Email:''' priyankmodi99@gmail.com<br /> |
'''Email:''' priyankmodi99@gmail.com<br /> |
||
'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses <br /> |
'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)<br /> |
||
'''IRC:''' pmodi<br /> |
'''IRC:''' pmodi<br /> |
||
'''Timezone:''' GMT +0530 hrs<br /> |
'''Timezone:''' GMT +0530 hrs<br /> |
||
'''Linkedin:''' https://www.linkedin.com/in/priyank-modi-81584b175/ <br /> |
'''Linkedin:''' https://www.linkedin.com/in/priyank-modi-81584b175/ <br /> |
||
'''Github:''' https://github.com/priyankmodiPM <br /> |
'''Github:''' https://github.com/priyankmodiPM <br /> |
||
'''Website:''' https://priyankmodipm.github.io/ <br /> |
|||
== Why I am interested in Apertium == |
== Why I am interested in Apertium == |
||
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement. |
|||
Because Apertium is free/open-source software.<br /> |
|||
Because its community is strongly committed to under-resourced and minoritised/marginalised languages. <br /> |
|||
Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc. |
|||
Because there is lot of good work done and being done in it. <br /> |
|||
Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc. <br /> |
|||
The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal). |
|||
A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers! |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
== Which of the published tasks are you interested in? What do you plan to do? == |
||
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state. |
'''Adopt an unreleased language pair.''' I plan on developing the Hindi-Punjabi language pair in both directions i.e. '''hin-pan and pan-hin'''. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state. |
||
== My Proposal == |
== My Proposal == |
||
=== Mentors/Experienced members in Contact === |
|||
Francis Tyers, Hèctor Alòs i Font |
|||
=== Brief of deliverables === |
|||
* A morph based dictionary of Punjabi with ~16,000 words |
|||
* Improvements(current rules and word pairs) and additions to hin-pan bidictionary |
|||
* Lexical selection and transfer rules for the pair |
|||
* Translator for hin-pan and pan-hin with WER <20% |
|||
* Morphological disambiguator for the pair |
|||
I plan on achieving coverage close to [http://wiki.apertium.org/wiki/Hindi_and_Urdu/Work_plan_(GSOC_2014) hin-urd pair]. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve. |
|||
=== Why Google and Apertium should sponsor it === |
=== Why Google and Apertium should sponsor it === |
||
* Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources. |
* Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources. |
||
* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section |
* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge). |
||
* I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator |
* I believe the Apertium architecture is suited perfectly for this pair and can '''replace the current state-of-art translator'''. |
||
* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and |
* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills) |
||
* In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool. |
|||
=== How and who it will benefit in society === |
=== How and who it will benefit in society === |
||
The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses. |
|||
=== Google Translate : Analysis and comparison === |
=== Google Translate : Analysis and comparison === |
||
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target): |
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target): |
||
* hin-pan: |
* hin-pan: 79.23% WER |
||
* |
* hin-pan: 56.56% PER |
||
* pan-hin: 82.23% WER |
|||
* pan-hin: 57.83% WER |
|||
The results are |
The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text: |
||
Original text ( |
Original source text (Hindi): |
||
<blockquote> |
<blockquote> |
||
'''गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत. <br>''' |
|||
altri invece '''ne''' hanno apprezzato la spontaneità, la tenacia e l'affettuosità |
|||
''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.'' |
|||
</blockquote> |
</blockquote> |
||
Google translation: |
Google translation (Punjabi): |
||
<blockquote> |
<blockquote> |
||
'''ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ. <br>''' |
|||
altres han apreciat '''la seva''' espontaneïtat, tenacitat i afecte |
|||
''The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.'' |
|||
<br> <br> |
|||
==== Note : ==== |
|||
Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it. |
|||
</blockquote> |
</blockquote> |
||
Translation achieved using Apertium model(Punjabi): |
|||
Post-edited translation: |
|||
<blockquote> |
<blockquote> |
||
'''ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ. <br>''' |
|||
altres '''n''''han apreciat l'espontaneïtat, tenacitat i afecte |
|||
''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.'' |
|||
</blockquote> |
</blockquote> |
||
It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following : |
|||
It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts. |
|||
* Google Translate relies on the n-grams available to it. |
|||
* In case of rarely used words, it fails to translate those and worse, fails to capture the tense. |
|||
* In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence. |
|||
=== Implementation choices === |
|||
* '''3 stage transfer :''' I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology. |
|||
* '''Clean and consistent practices :''' As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified. |
|||
* '''AnnCorra Dependencies :''' In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. [http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper] |
|||
* '''Manual Disambiguation :''' For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category. |
|||
* '''Transliteration :''' For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography. |
|||
* '''WX notations :''' I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages. |
|||
=== Current state of dictionaries === |
=== Current state of dictionaries === |
||
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these |
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these [https://docs.google.com/document/d/1GmBIlGVxMinhJJVZWnVWLHTs8vvKAuxA0vCLBDOyBJY/edit?usp=sharing here]. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way. |
||
The problems in the current dictionaries include : |
|||
*Multiple unnecessary analyses. '''Fix -''' Keep only first analysis and add others, if required, using <e r="RL"> |
|||
<pre> |
|||
<pardef n="गलत__adj"> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br> |
|||
<e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
*Inflections added separately. '''Fix -''' stick consistent to adding inflections to root in a single definition |
|||
<pre> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e> |
|||
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e> |
|||
</pre> |
|||
*Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. '''Fix -''' add some extra flag/comment |
|||
<pre> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e> |
|||
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e> |
|||
</pre> |
|||
* Inconsistent/Statistically Incorrect pairs. '''Fix -''' Statistical and Manual disambiguation |
|||
<pre> |
|||
<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e> |
|||
<e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> |
|||
<e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e> |
|||
<e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e> |
|||
<e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e> |
|||
</pre> |
|||
* Incomplete morph - analysis (all forms not added). '''Fix -''' Statistical and Manual fixes, involving comparisons and additionss |
|||
<pre> |
|||
<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e> |
|||
</pre> |
|||
'''Why haven't I fixed these yet?''' Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. <strike>I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously).</strike> Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way. |
|||
=== Resources === |
=== Resources === |
||
[to be added - under confirmation for public use] |
[to be added - under confirmation for public use] <br> |
||
[https://hi.wiktionary.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A5%8D%E0%A4%B7%E0%A4%A8%E0%A4%B0%E0%A5%80:%E0%A4%AA%E0%A4%82%E0%A4%9C%E0%A4%BE%E0%A4%AC%E0%A5%80-%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80_%E0%A4%B6%E0%A4%AC%E0%A5%8D%E0%A4%A6%E0%A4%95%E0%A5%8B%E0%A4%B6_%E0%A4%85_%E0%A4%B8%E0%A5%87_%E0%A4%94 Hindi-Punjabi Dictionary - wiktionary] <br> |
|||
[https://glosbe.com/pa/hi Punjabi-Hindi dictionary - Glosbe] (awaiting confirmation) <br> |
|||
[https://pa.wikipedia.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%B8%E0%A8%AB%E0%A8%BC%E0%A8%BE Punjabi Articles - Wikipedia] <br> |
|||
[https://pa.wiktionary.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%AA%E0%A9%B0%E0%A8%A8%E0%A8%BE Punjabi Dictionary - Wiktionary] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/pa/ Wikidumps-punjabi 1] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/pa/ Wikidumps-punjabi 2] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/pa/ Wikidumps-punjabi 3] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/hi/ Wikidumps-hindi 1] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/hi/ Wikidumps-hindi 2] <br> |
|||
[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/hi/ Wikidumps-hindi 3] |
|||
=== Workplan === |
=== Workplan === |
||
{| class="wikitable" style="background-color:#b3ffb3;" |
{| class="wikitable" style="background-color:#b3ffb3;" |
||
! style="width: 13%" | PHASE |
|||
! PHASE AND BRIEF DESCRIPTION |
|||
! DURATION |
! style="width: 12%" | DURATION |
||
! style="width: 36%" | GOALS OF THE WEEK |
|||
! TASK EXPLANATION |
|||
! style="width: 13%" | BIDIX |
|||
! DELIVERABLE OF THE WEEK |
|||
! style="width: 13%" | WER |
|||
! style="width: 13%" | Coverage |
|||
| COMMUNITY BONDING PERIOD |
|||
|-style="background-color:#dbfedb;" |
|||
| style="text-align:center" | Post Application Period |
|||
| |
|||
* START:April 6th |
|||
* END:May 3rd |
|||
| |
|||
* List and discuss implementation choices of hin-pan bidix and urd-hin pair |
|||
* Reading up on the details of Transfer rules(whether or not a 3-stage transfer is the best way for this pair) and assigning weights |
|||
* Finding Language Resources |
|||
* Making Frequency lists |
|||
| |
| |
||
* START:April 23rd |
|||
* END:May 13th |
|||
| |
| |
||
* Playing around with the lttoolbox and apertium modules and using every function and understanding all the flags and arguments of the functions. |
|||
* Reading up on the details of SWIG. |
|||
* Taking inputs from various apertium users on what would be the ideal implementation that they would want. |
|||
| |
| |
||
|-style="background-color:#b3ffb3;" |
|-style="background-color:#b3ffb3;" |
||
| style="text-align:center" | Community Bonding Period : Closed Categories |
|||
| Week ONE : Lttoolbox setup |
|||
| |
| |
||
* START:May |
* START:May 4th |
||
* END:May |
* END:May 24th |
||
| |
| |
||
* Function words(voc prn, cnj, det, prn, post, gen_endings) |
|||
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface |
|||
* Transfer rules for post-positions |
|||
* Testing all pointer based data manipulation for any errors. (A common problem that might occur with swig bindings) |
|||
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file, |
|||
* Identifying Static Class members, Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually. |
|||
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces) |
|||
| |
| |
||
| |
|||
* First importable wrapper of Lttoolbox module |
|||
|-style="background-color:# |
|-style="background-color:#aaffaa;" |
||
| style="text-align:center" | Community Bonding Period : Adjectives |
|||
| WEEK TWO: Variable handling in SWIG for Lttoolbox module |
|||
| |
| |
||
* START:May |
* START:May 25th |
||
* END:May |
* END:May 31st |
||
| |
| |
||
* Punjabi mono-dictionary : adj and adv |
|||
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface |
|||
* Expanding bilingual dictionary |
|||
* Testing all pointer based data manipulation for any errors. (A common problem that might occur with swig bindings) |
|||
* Lexical selection rules for adj and adv |
|||
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file |
|||
* Identifying Static Class members: Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually. |
|||
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces) |
|||
| |
| |
||
| |
|||
* Second version of the wrapper with all data type usage support |
|||
| |
|||
|-style="background-color:#80ff80;;" |
|||
|-style="background-color:#91f991;;" |
|||
| WEEK THREE: Templating and Object Handling for Lttoolbox module |
|||
| style="text-align:center" | Week ONE: Verbal Paradigms |
|||
| |
| |
||
* START: |
* START:June 1st |
||
* END:June |
* END:June 7th |
||
| |
| |
||
* Punjabi mono-dictionary : Verbal paradigms(vblex, vbser, vaux) |
|||
* In order to create wrappers, one has to tell SWIG to create wrappers for a particular template instantiation. Hence all the templates have to be explicitly declared specific to the data being manipulated in them,. |
|||
* Expanding bilingual dictionary |
|||
* C++ Reference Counted Objects: Referencing and Dereferencing of objects have to be taken care of so that no error occurs, another place where SWIG isn’t smart enough. |
|||
* Lexical selection rules for verbs |
|||
* Handling C++ overloaded functions: Overloading support is not quite as flexible as in C++. Sometimes there are methods that SWIG can't disambiguate, if such errors appear then they have to be taken care of manually in the interface file of the wrapper. |
|||
* testvoc : adj, adv |
|||
| style="text-align:center" | ~ 3,000 |
|||
| |
|||
| |
|||
|-style="background-color:#82fa82;" |
|||
| style="text-align:center" | Week TWO: Dictionary Expansion |
|||
| |
| |
||
* START:June 8th |
|||
* Third version of the wrapper with all functions importable from python. |
|||
* END:June 14th |
|||
|-style="background-color:#66ff66;" |
|||
| WEEK FOUR:Testing and improving cross language polymorphism, Making the module more pythonistic, Exception Handling |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* START:June 4th |
|||
* Lexical selection rules |
|||
* END:June 10th |
|||
| style="text-align:center" | ~ 5,000 |
|||
| |
| |
||
| |
|||
* Implement Director Classes: No mechanism exists to pass method calls down the inheritance chain from C++ to Python. In particular, if a C++ class has been extended in Python, these extensions will not be visible from C++ code. Virtual method calls from C++ are thus not able access the lowest implementation in the inheritance chain. There exists a feature implemented in SWIG called directors, The job of the directors is to route method calls correctly, either to C++ implementations higher in the inheritance chain or to Python implementations lower in the inheritance chain. |
|||
|-style="background-color: #64ff64;" |
|||
* Writing c++ helper functions: Sometimes the SWIG module misses bits of functionality because there is no easy way to construct and manipulate a suitable datatype, for those cases c++ helper functions need to be written. |
|||
| style="text-align:center" | Week THREE: Dictionary Expansion |
|||
* Writing High-Level Python function to provide a high-level Python interface built on top of low-level helper functions.Error Handling: If C++ throws an error then it is better to convert it into a python exception. |
|||
| |
| |
||
* START:June 15th |
|||
* Fourth and final version with input functions and all helper functions written in python. |
|||
* END:June 21st |
|||
|-style="background-color: #4dff4d;" |
|||
| WEEK FIVE: Apertium setup |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* START:June 11th |
|||
* Lexical selection rules |
|||
* END:June 17th |
|||
| style="text-align:center" | ~ 6,500 |
|||
| style="text-align:center" | < 25% (hin-pan) |
|||
| style="text-align:center" | > 65% (hin-pan) <br> >60% (pan-hin) |
|||
|-style="background-color:#3dff3d;" |
|||
| style="text-align:center" | Week FOUR: More works on verbs and testing |
|||
| |
| |
||
* START:June 15th |
|||
* Ref Week 1(***) |
|||
* END:June 21st |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* First version of the apertium module that is python importable |
|||
* Lexical selection rules |
|||
|-style="background-color:#1aff1a;" |
|||
* Manual Disambiguation of rules hin-pan(src-trg) |
|||
| WEEK SIX: Variable handling in SWIG for Apertium module |
|||
| style="text-align:center" | ~ 7,500 |
|||
| |
| |
||
| |
|||
* START:June 18th |
|||
|-style="background-color:#13ff13;" |
|||
* END:June 24th |
|||
| style="text-align:center" | Week FIVE : focus on Nouns |
|||
| |
| |
||
* START:June 22nd |
|||
* Ref Week 2 |
|||
* END:June 28th |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Second version of the apertium wrapper |
|||
* Lexical selection rules |
|||
|-style="background-color:#00e600;" |
|||
| style="text-align:center" | ~ 9,000 |
|||
| WEEK SEVEN: Templating and Object Handling for Lttoolbox module |
|||
| |
| |
||
| |
|||
* START:June 25th |
|||
|-style="background-color:#00ee00;" |
|||
* END:July 1st |
|||
| style="text-align:center" | Week SIX : Expanding Dictionaries |
|||
| |
| |
||
* START:June 29th |
|||
* Ref Week 3 |
|||
* END:July 5th |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Third version of the apertium wrapper with all functions importable from python |
|||
* Lexical selection rules |
|||
|-style="background-color:#00cc00;" |
|||
'''First Evaluation(June 29th - July 3rd)''' |
|||
| WEEK EIGHT: Testing and improving cross language polymorphism, Making the module more pythonistic, Exception Handling |
|||
| style="text-align:center" | ~ 10,500 |
|||
| |
| |
||
| |
|||
* START:July 2nd |
|||
|-style="background-color: #00df00;" |
|||
* END:July 8th |
|||
| style="text-align:center" | Week SEVEN : Expanding Dictionaries |
|||
| |
| |
||
* START:July 6th |
|||
* Ref Week 4 |
|||
* END:July 12th |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Fourth and final version with input functions and all helper functions written in python. |
|||
* Lexical selection rules |
|||
|-style="background-color: #00b300;" |
|||
* Manual disambiguation of rules(pan-hin) |
|||
| WEEK NINE: Extensive alpha testing of modules built |
|||
| style="text-align:center" | ~ 12,000 |
|||
| |
| |
||
| |
|||
* START:July 9th |
|||
|-style="background-color:#00d700;" |
|||
* END:July 15th |
|||
| style="text-align:center" | Week EIGHT : Transfer rules(hin-pan) |
|||
| |
| |
||
* START:July 13th |
|||
* Testing the modules built by writing unit-tests for the functions in the modules |
|||
* END:July 19th |
|||
* Starting the documentation of the modules, since there are a lot of funcntions and the way swig deals with python is a little different than raw python, proper documentation of all the modules and their usages |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Version 1 Documentation written |
|||
* Lexical selection rules |
|||
* Tests written for the lttoolbox module |
|||
* Transfer rules(hin-pan) |
|||
|-style="background-color:#009900;" |
|||
| style="text-align:center" | ~ 13,000 |
|||
| WEEK TEN: Finishing Documentation |
|||
| |
| |
||
| |
|||
* START:July 16th |
|||
|-style="background-color:#00b600;" |
|||
* END:July 22nd |
|||
| style="text-align:center" | Week NINE : Transfer rules |
|||
| |
| |
||
* START:July 20th |
|||
* Finishing the documentation of the module |
|||
* END:July 26th |
|||
* Distribute for Beta testing, so that end users validate the usability, functionality, compatibility, and reliability |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Tests written for apertium module |
|||
* Lexical selection rules |
|||
* Documentation version 2. |
|||
* Transfer rules : pan-hin |
|||
|-style="background-color:#008000;" |
|||
| style="text-align:center" | ~ 14,000 |
|||
| WEEK ELEVEN: Beta testing and changes(if any) |
|||
| |
| |
||
| |
|||
* START:July 23rd |
|||
|-style="background-color:#009d00;" |
|||
* END:July 29th |
|||
| style="text-align:center" | Week TEN |
|||
| |
| |
||
* START:July 27th |
|||
* Taking reviews of beta testing and implementing changes if any. |
|||
* END:August 2nd |
|||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Review Fix Version of wapper realease |
|||
* Lexical selection rules |
|||
|-style="background-color:#006600;" |
|||
'''Second Evaluation(July 27th - July 31st)''' |
|||
| WEEK TWELVE: Deciding on the library structure and making module pip installable |
|||
| style="text-align:center" | ~ 15,000 |
|||
| style="text-align:center" | <20% (hin-pan) <br> <25% (pan-hin) |
|||
| style="text-align:center" | >82% (hin-pan) <br> >77% (pan-hin) |
|||
|-style="background-color:#008a00;" |
|||
| style="text-align:center" | Week ELEVEN |
|||
| |
| |
||
* START: |
* START:August 3rd |
||
* END:August |
* END:August 9th |
||
| |
| |
||
* Expanding bilingual dictionary |
|||
* Making the super wrapper for the modules. |
|||
* Lexical selection rules |
|||
* Making the module pip installable by writing scripts and uploading to PyPI |
|||
* Disambiguation rules |
|||
* Update Documentation |
|||
* Transfer rules |
|||
| style="text-align:center" | ~ 16,000 |
|||
| |
| |
||
| |
|||
* One wrapper with the 2 created wrappers inside it |
|||
|-style="background-color:#007e00;" |
|||
* Pip Installable module |
|||
| style="text-align:center" | Week TWELVE : Testvoc |
|||
|-style="background-color:#004d00;" |
|||
| WEEK THIRTEEN: Final reviews and bug report |
|||
| |
| |
||
* START:August |
* START:August 10th |
||
* END:August |
* END:August 16th |
||
| |
| |
||
* Testvoc hin-pan |
|||
* Analyse and make bug report for the bugs in the code. |
|||
* Add rules, words |
|||
* Make Final documentation |
|||
| style="text-align:center" | ~ 16,500 |
|||
* Release Final Module |
|||
| |
| |
||
| |
|||
* Final Release of the wrapper. |
|||
|-style="background-color:#006d00;" |
|||
| style="text-align:center" | Week THIRTEEN : Finishing up |
|||
| |
|||
* START:August 17th |
|||
* END:August 23rd |
|||
| |
|||
* Testvoc pan-hin |
|||
* Add rules, words |
|||
'''PERSONAL CODE FREEZE : August 22nd''' |
|||
| style="text-align:center" | ~ 17,000 |
|||
| |
|||
| |
|||
|-style="background-color:#005a00;" |
|||
| style="text-align:center" | Week FOURTEEN : Review |
|||
| |
|||
* START:August 24th |
|||
* END:August 30th |
|||
| |
|||
* Review and documentation |
|||
'''Final evaluation(August 24th - August 31st) |
|||
| style="text-align:center" | ~ 17,000 |
|||
| style="text-align:center" | ~15% (hin-pan) <br> <20% (pan-hin) |
|||
| style="text-align:center" | ~90% (hin-pan) <br> ~83% (pan-hin) |
|||
|} |
|} |
||
(***)The tasks are similar to the tasks of the referenced week |
|||
== Skills == |
== Skills == |
||
I'm currently a third year( |
I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data. |
||
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well. |
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well. |
||
I also have a lot of experience studying and generating data which I feel is |
I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on ''''Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus'''' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at '''LREC 2020''' and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. ([https://sigsem.uvt.nl/isa16/ ISA list of accepted papers], [https://www.researchgate.net/publication/340266259_Hindi_TimeBank_An_ISO-TimeML_Annotated_Reference_Corpus Link to paper]) |
||
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same. |
|||
Due to the focused nature of our course, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, etc. all of which required a working understanding of Natural Language Processing. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required. |
|||
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required. |
|||
I am fluent in English, Hindi and Punjabi. |
I am fluent in English, Hindi and Punjabi. |
||
== Coding challenge == |
|||
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi Coding challenge repository] <br> |
|||
Original corpus : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/texts/story.hin.txt source lang-hin] <br> |
|||
Translated output : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt target lang-pan] <br> |
|||
Human Translation : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/pan_original_translation target lang-pan(human)] <br> |
|||
Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)<br> |
|||
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen) <br> |
|||
WER achieved : 15.30 % <br> |
|||
PER achieved : 15.03 % |
|||
Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these ([http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper]) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful. <br> |
|||
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair). |
|||
== Non-Summer-of-Code plans for the Summer == |
== Non-Summer-of-Code plans for the Summer == |
||
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend |
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project). |
||
[[Category:GSoC 2020 student proposals]] |
Latest revision as of 19:32, 4 June 2020
Project Progress can be seen here
Contents
- 1 Contact Information
- 2 Why I am interested in Apertium
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 My Proposal
- 5 Skills
- 6 Coding challenge
- 7 Non-Summer-of-Code plans for the Summer
Contact Information[edit]
Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM
Website: https://priyankmodipm.github.io/
Why I am interested in Apertium[edit]
Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement.
Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc.
The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal).
A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!
Which of the published tasks are you interested in? What do you plan to do?[edit]
Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
My Proposal[edit]
Mentors/Experienced members in Contact[edit]
Francis Tyers, Hèctor Alòs i Font
Brief of deliverables[edit]
- A morph based dictionary of Punjabi with ~16,000 words
- Improvements(current rules and word pairs) and additions to hin-pan bidictionary
- Lexical selection and transfer rules for the pair
- Translator for hin-pan and pan-hin with WER <20%
- Morphological disambiguator for the pair
I plan on achieving coverage close to hin-urd pair. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve.
Why Google and Apertium should sponsor it[edit]
- Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
- Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge).
- I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator.
- This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills)
- In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool.
How and who it will benefit in society[edit]
The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses.
Google Translate : Analysis and comparison[edit]
Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):
- hin-pan: 79.23% WER
- hin-pan: 56.56% PER
- pan-hin: 82.23% WER
- pan-hin: 57.83% WER
The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text:
Original source text (Hindi):
गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
Google translation (Punjabi):
ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ.
The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.
Note :[edit]
Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it.
Translation achieved using Apertium model(Punjabi):
ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.
It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following :
- Google Translate relies on the n-grams available to it.
- In case of rarely used words, it fails to translate those and worse, fails to capture the tense.
- In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence.
Implementation choices[edit]
- 3 stage transfer : I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology.
- Clean and consistent practices : As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified.
- AnnCorra Dependencies : In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. link to paper
- Manual Disambiguation : For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category.
- Transliteration : For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography.
- WX notations : I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages.
Current state of dictionaries[edit]
A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
The problems in the current dictionaries include :
- Multiple unnecessary analyses. Fix - Keep only first analysis and add others, if required, using <e r="RL">
<pardef n="गलत__adj"> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br> <e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e> </pardef>
- Inflections added separately. Fix - stick consistent to adding inflections to root in a single definition
<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e> <e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e>
- Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. Fix - add some extra flag/comment
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e> <e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e>
- Inconsistent/Statistically Incorrect pairs. Fix - Statistical and Manual disambiguation
<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> <e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e> <e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e> <e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e>
- Incomplete morph - analysis (all forms not added). Fix - Statistical and Manual fixes, involving comparisons and additionss
<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e>
Why haven't I fixed these yet? Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously). Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way.
Resources[edit]
[to be added - under confirmation for public use]
Hindi-Punjabi Dictionary - wiktionary
Punjabi-Hindi dictionary - Glosbe (awaiting confirmation)
Punjabi Articles - Wikipedia
Punjabi Dictionary - Wiktionary
Wikidumps-punjabi 1
Wikidumps-punjabi 2
Wikidumps-punjabi 3
Wikidumps-hindi 1
Wikidumps-hindi 2
Wikidumps-hindi 3
Workplan[edit]
PHASE | DURATION | GOALS OF THE WEEK | BIDIX | WER | Coverage |
---|---|---|---|---|---|
Post Application Period |
|
|
|||
Community Bonding Period : Closed Categories |
|
|
|||
Community Bonding Period : Adjectives |
|
|
|||
Week ONE: Verbal Paradigms |
|
|
~ 3,000 | ||
Week TWO: Dictionary Expansion |
|
|
~ 5,000 | ||
Week THREE: Dictionary Expansion |
|
|
~ 6,500 | < 25% (hin-pan) | > 65% (hin-pan) >60% (pan-hin) |
Week FOUR: More works on verbs and testing |
|
|
~ 7,500 | ||
Week FIVE : focus on Nouns |
|
|
~ 9,000 | ||
Week SIX : Expanding Dictionaries |
|
First Evaluation(June 29th - July 3rd) |
~ 10,500 | ||
Week SEVEN : Expanding Dictionaries |
|
|
~ 12,000 | ||
Week EIGHT : Transfer rules(hin-pan) |
|
|
~ 13,000 | ||
Week NINE : Transfer rules |
|
|
~ 14,000 | ||
Week TEN |
|
Second Evaluation(July 27th - July 31st) |
~ 15,000 | <20% (hin-pan) <25% (pan-hin) |
>82% (hin-pan) >77% (pan-hin) |
Week ELEVEN |
|
|
~ 16,000 | ||
Week TWELVE : Testvoc |
|
|
~ 16,500 | ||
Week THIRTEEN : Finishing up |
|
PERSONAL CODE FREEZE : August 22nd |
~ 17,000 | ||
Week FOURTEEN : Review |
|
Final evaluation(August 24th - August 31st) |
~ 17,000 | ~15% (hin-pan) <20% (pan-hin) |
~90% (hin-pan) ~83% (pan-hin) |
Skills[edit]
I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data.
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020 and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. (ISA list of accepted papers, Link to paper)
I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.
Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
I am fluent in English, Hindi and Punjabi.
Coding challenge[edit]
I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Coding challenge repository
Original corpus : source lang-hin
Translated output : target lang-pan
Human Translation : target lang-pan(human)
Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen)
WER achieved : 15.30 %
PER achieved : 15.03 %
Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these (link to paper) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful.
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair).
Non-Summer-of-Code plans for the Summer[edit]
Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project).