Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"

Latest revision as of 19:32, 4 June 2020

Project Progress can be seen here

Contact Information[edit]

Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM
Website: https://priyankmodipm.github.io/

Why I am interested in Apertium[edit]

Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement.

Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc.

The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal).

A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!

Which of the published tasks are you interested in? What do you plan to do?[edit]

Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.

My Proposal[edit]

Mentors/Experienced members in Contact[edit]

Francis Tyers, Hèctor Alòs i Font

Brief of deliverables[edit]

A morph based dictionary of Punjabi with ~16,000 words
Improvements(current rules and word pairs) and additions to hin-pan bidictionary
Lexical selection and transfer rules for the pair
Translator for hin-pan and pan-hin with WER <20%
Morphological disambiguator for the pair

I plan on achieving coverage close to hin-urd pair. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve.

Why Google and Apertium should sponsor it[edit]

Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge).
I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator.
This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills)
In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool.

How and who it will benefit in society[edit]

The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses.

Google Translate : Analysis and comparison[edit]

Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):

hin-pan: 79.23% WER
hin-pan: 56.56% PER
pan-hin: 82.23% WER
pan-hin: 57.83% WER

The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text:

Original source text (Hindi):

गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.

Google translation (Punjabi):

ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ.
The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.

Note :[edit]

Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it.

Translation achieved using Apertium model(Punjabi):

ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ.
Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.

It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following :

Google Translate relies on the n-grams available to it.
In case of rarely used words, it fails to translate those and worse, fails to capture the tense.
In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence.

Implementation choices[edit]

3 stage transfer : I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology.
Clean and consistent practices : As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified.
AnnCorra Dependencies : In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. link to paper
Manual Disambiguation : For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category.
Transliteration : For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography.
WX notations : I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages.

Current state of dictionaries[edit]

A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.

The problems in the current dictionaries include :

Multiple unnecessary analyses. Fix - Keep only first analysis and add others, if required, using <e r="RL">

<pardef n="गलत__adj">
  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br>
  <e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e>
</pardef>

Inflections added separately. Fix - stick consistent to adding inflections to root in a single definition

<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e>
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e>
<e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e>

Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. Fix - add some extra flag/comment

<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e>
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e>
<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e>

Inconsistent/Statistically Incorrect pairs. Fix - Statistical and Manual disambiguation

<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e>
<e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e> 
<e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e>
<e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e>
<e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e>

Incomplete morph - analysis (all forms not added). Fix - Statistical and Manual fixes, involving comparisons and additionss

<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e>

Why haven't I fixed these yet? Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously). Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way.

Resources[edit]

[to be added - under confirmation for public use]
Hindi-Punjabi Dictionary - wiktionary
Punjabi-Hindi dictionary - Glosbe (awaiting confirmation)
Punjabi Articles - Wikipedia
Punjabi Dictionary - Wiktionary
Wikidumps-punjabi 1
Wikidumps-punjabi 2
Wikidumps-punjabi 3
Wikidumps-hindi 1
Wikidumps-hindi 2
Wikidumps-hindi 3

Workplan[edit]

PHASE	DURATION	GOALS OF THE WEEK	BIDIX	WER	Coverage
Post Application Period	START:April 6th END:May 3rd	List and discuss implementation choices of hin-pan bidix and urd-hin pair Reading up on the details of Transfer rules(whether or not a 3-stage transfer is the best way for this pair) and assigning weights Finding Language Resources Making Frequency lists
Community Bonding Period : Closed Categories	START:May 4th END:May 24th	Function words(voc prn, cnj, det, prn, post, gen_endings) Transfer rules for post-positions
Community Bonding Period : Adjectives	START:May 25th END:May 31st	Punjabi mono-dictionary : adj and adv Expanding bilingual dictionary Lexical selection rules for adj and adv
Week ONE: Verbal Paradigms	START:June 1st END:June 7th	Punjabi mono-dictionary : Verbal paradigms(vblex, vbser, vaux) Expanding bilingual dictionary Lexical selection rules for verbs testvoc : adj, adv	~ 3,000
Week TWO: Dictionary Expansion	START:June 8th END:June 14th	Expanding bilingual dictionary Lexical selection rules	~ 5,000
Week THREE: Dictionary Expansion	START:June 15th END:June 21st	Expanding bilingual dictionary Lexical selection rules	~ 6,500	< 25% (hin-pan)	> 65% (hin-pan) >60% (pan-hin)
Week FOUR: More works on verbs and testing	START:June 15th END:June 21st	Expanding bilingual dictionary Lexical selection rules Manual Disambiguation of rules hin-pan(src-trg)	~ 7,500
Week FIVE : focus on Nouns	START:June 22nd END:June 28th	Expanding bilingual dictionary Lexical selection rules	~ 9,000
Week SIX : Expanding Dictionaries	START:June 29th END:July 5th	Expanding bilingual dictionary Lexical selection rules First Evaluation(June 29th - July 3rd)	~ 10,500
Week SEVEN : Expanding Dictionaries	START:July 6th END:July 12th	Expanding bilingual dictionary Lexical selection rules Manual disambiguation of rules(pan-hin)	~ 12,000
Week EIGHT : Transfer rules(hin-pan)	START:July 13th END:July 19th	Expanding bilingual dictionary Lexical selection rules Transfer rules(hin-pan)	~ 13,000
Week NINE : Transfer rules	START:July 20th END:July 26th	Expanding bilingual dictionary Lexical selection rules Transfer rules : pan-hin	~ 14,000
Week TEN	START:July 27th END:August 2nd	Expanding bilingual dictionary Lexical selection rules Second Evaluation(July 27th - July 31st)	~ 15,000	<20% (hin-pan) <25% (pan-hin)	>82% (hin-pan) >77% (pan-hin)
Week ELEVEN	START:August 3rd END:August 9th	Expanding bilingual dictionary Lexical selection rules Disambiguation rules Transfer rules	~ 16,000
Week TWELVE : Testvoc	START:August 10th END:August 16th	Testvoc hin-pan Add rules, words	~ 16,500
Week THIRTEEN : Finishing up	START:August 17th END:August 23rd	Testvoc pan-hin Add rules, words PERSONAL CODE FREEZE : August 22nd	~ 17,000
Week FOURTEEN : Review	START:August 24th END:August 30th	Review and documentation Final evaluation(August 24th - August 31st)	~ 17,000	~15% (hin-pan) <20% (pan-hin)	~90% (hin-pan) ~83% (pan-hin)

Skills[edit]

I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020 and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. (ISA list of accepted papers, Link to paper)

I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.

Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.

I am fluent in English, Hindi and Punjabi.

Coding challenge[edit]

I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Coding challenge repository
Original corpus : source lang-hin
Translated output : target lang-pan
Human Translation : target lang-pan(human)

Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)
(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen)
WER achieved : 15.30 %
PER achieved : 15.03 %

Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these (link to paper) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful.
Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair).

Non-Summer-of-Code plans for the Summer[edit]

Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project).

Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"

Latest revision as of 19:32, 4 June 2020

Contents

Contact Information[edit]

Why I am interested in Apertium[edit]

Which of the published tasks are you interested in? What do you plan to do?[edit]

My Proposal[edit]

Mentors/Experienced members in Contact[edit]

Brief of deliverables[edit]

Why Google and Apertium should sponsor it[edit]

How and who it will benefit in society[edit]

Google Translate : Analysis and comparison[edit]

Note :[edit]

Implementation choices[edit]

Current state of dictionaries[edit]

Resources[edit]

Workplan[edit]

Skills[edit]

Coding challenge[edit]

Non-Summer-of-Code plans for the Summer[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+Project Progress can be seen [https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi/progress here]
 == Contact Information ==
 '''Name:''' Priyank Modi<br />
 '''Email:''' priyankmodi99@gmail.com<br />
-'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses <br />
+'''Current Designation:''' Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for the Linguistics courses (listed in skills)<br />
 '''IRC:''' pmodi<br />
 '''Timezone:''' GMT +0530 hrs<br />
 '''Linkedin:''' https://www.linkedin.com/in/priyank-modi-81584b175/ <br />
 '''Github:''' https://github.com/priyankmodiPM <br />
+'''Website:''' https://priyankmodipm.github.io/ <br />
 == Why I am interested in Apertium ==
+Apertium is an Open Source Rule-based machine translation system. Being an undergrad researcher at the LTRC lab in IIIT-H currently working on understanding the nuances of Indian languages and developing systems which improve our analysis of the same, Machine Translation interests me because it’s a complex problem which tries to achieve a very important application, and despite being a recognized problem since years, is considered to be achievable only though human involvement.
-Because Apertium is free/open-source software.<br />
-Because its community is strongly committed to under-resourced and minoritised/marginalised languages. <br />
+Translating data to other languages, and especially low - resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. The dictionaries made in the process are crucial for low resource languages and can even help making spell checkers etc.
-Because there is lot of good work done and being done in it. <br />
-Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc. <br />
+The most striking factor for me is the fact that while recent trends to find a solution to MT lean towards Neural Networks and Deep Learning, which fall short when it comes to resource-poor languages, Apertium looks to tackle this using a rule based approach. Not only is this beneficial because of the level of understanding it provides instead of simply blaming data for poor results, it actually shows that it can perform better for low resource languages(even for the pair I present in my proposal).
+A tool which is rule-based and open source really helps the community with language pairs that are resource - poor and gives them free translations for their needs and that is why I want to work on improving on it. I want to work with Apertium and GSoC so I can contribute to an important Open Source Tool while also honing my own skills, and I hope to become a part of this amazing community of developers!
 ==  Which of the published tasks are you interested in? What do you plan to do?  ==
-Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
+'''Adopt an unreleased language pair.''' I plan on developing the Hindi-Punjabi language pair in both directions i.e. '''hin-pan and pan-hin'''. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.
 == My Proposal ==
+=== Mentors/Experienced members in Contact ===
+Francis Tyers, Hèctor Alòs i Font
+=== Brief of deliverables ===
+* A morph based dictionary of Punjabi with ~16,000 words
+* Improvements(current rules and word pairs) and additions to hin-pan bidictionary
+* Lexical selection and transfer rules for the pair
+* Translator for hin-pan and pan-hin with WER <20%
+* Morphological disambiguator for the pair
+I plan on achieving coverage close to [http://wiki.apertium.org/wiki/Hindi_and_Urdu/Work_plan_(GSOC_2014) hin-urd pair]. In the ideal case, I plan on getting better results than this pair since I feel enough data is available and given some dedicated work is done for 3 months, the predicted results aren't very difficult to achieve.
 === Why Google and Apertium should sponsor it ===
 * Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
-* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 2.1) On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 3 : Coding Challenge).
+* Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 4.5 : Google Translate : Analysis and comparison). On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 6 : Coding Challenge).
-* I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator for this pair.
+* I believe the Apertium architecture is suited perfectly for this pair and can '''replace the current state-of-art translator'''.
-* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and can be an important resource.
+* This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and will be an important resource. In addition to this since it'll be publicly available, it'll drive research in vernacular languages, even in my own case(see Section 5 : Skills)
+* In my knowledge, very few attempts are made, even inside Apertium, at translation for Indian languages, the prime one headed by my lab,LTRC IIIT Hyderabad(not covering the hin-pan pair specifically). But even that project has been losing activity recently and has some issues in it's pipeline. Since these languages have a good number of speakers but not enough easily available online resources, I think it's important to work on these, given the detailed morphological analysis Apertium dictionaries offer in addition to it providing a great translation tool.
 === How and who it will benefit in society ===
-As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.
+The Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exists a good amount of vernacular literature and scriptures which could be circulated to a larger group of people if this project is successful. It'll be an important open source dictionary resource for both languages. My larger aim from this project is to develop a chain of pairs covering Indian languages. Since Urdu and Punjabi share their roots, at least one more pair can be developed with minimum effort. My goal in this project will also be to properly document my design choices so that new Indic pairs can be taken up easily in subsequent years. I plan on working towards the Hindi-English pair next year since by then I'll have a good understanding of the architecture a cross-language-family pair uses.
 === Google Translate : Analysis and comparison ===
 Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):
-* hin-pan: 14.0% WER
+* hin-pan: 79.23% WER
-* pan-hin: 21.6% WER
+* hin-pan: 56.56% PER
+* pan-hin: 82.23% WER
+* pan-hin: 57.83% WER
-The results are far from wonderful, especially when it comes to longer sentences with less frequently used words. Seemingly, Google translates using both Spanish and English as bridge languages, as can be seen, for example, by words that appear in these two languages in the final text (supposedly in Catalan) and that were not in the original Italian or Portuguese text. The use of English as intermediate between Romance languages causes problems known to all users, such as the translation of p2.pl verb forms with elided subject to p2.sg, the incorrect choice of past times in the verbs and the disappearance of some pronouns. Here is an example of the last case of the Italian test text (randomly obtained):
+The results are simply poor, especially when it comes to longer sentences with less frequently used words. It is rather easy to see that Google Translate doesn't try to capture the case or tense in sentences, rather picks the most commonly used form of that particular root. NER is very limited, in contrast to the Apertium module which captures it well(because of it's 3 stage transfer mechanism I believe). The use of English as intermediate(which seems to be the case here as well because some words translate to English and fail to convert to Punjabi maybe because of some errors in parsing,as pointed by Hector) causes problems, such as the incorrect choice of tense in the verbs, wrong choice/dissappearance of some pronouns and the inability to handle copula constructions as well as verbal clauses(especially when other words occur between two sub-clauses). Here is an example of some of these form the Hindi test text:
-Original text (bold mine):
+Original source text (Hindi):
 <blockquote>
+'''गिरजा आज फिर उस औरत को साथ लाया था.वही दुबली पतली मोटी-मोटी आंखें तीखी नाक और सांवले रंग वाली औरत. <br>'''
-altri invece '''ne''' hanno apprezzato la spontaneità, la tenacia e l'affettuosità
+''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.''
 </blockquote>
-Google translation:
+Google translation (Punjabi):
 <blockquote>
+'''ਚਰਚ ਨੇ todayਰਤ ਨੂੰ ਅੱਜ ਵਾਪਸ ਲਿਆਇਆ, ਉਹੀ ਪਤਲੀ womanਰਤ ਜਿਹੜੀ ਸੰਘਣੀ ਅੱਖਾਂ, ਤਿੱਖੀ ਨੱਕ ਅਤੇ ਹਨੇਰਾ ਰੰਗ. <br>'''
-altres han apreciat '''la seva''' espontaneïtat, tenacitat i afecte
+''The Church brought back todayਰਤ today. The same thin womanਰਤ which big-eyed, pointy nose and dark colored.''
+<br> <br>
+==== Note : ====
+Girija got translated to Church, although it was used as a named entity in this case. (Girija Ghar(house) is the Hindi and Punjabi translation for church). This is a good examples of how poor the NER is, since even though the NE occurs in the subject(nominal) position, the parser fails to capture it.
 </blockquote>
+Translation achieved using Apertium model(Punjabi):
-Post-edited translation:
 <blockquote>
+'''ਗਿਰਜਾ ਅਜ੍ਜ ਫਿਰ ਉਸ ਔਰਤ ਨੂੰ ਨਾਲ ਲਾਇਆ ਸੀ.ਉਹੀ ਦੁਬਲੀ ਪਤਲੀ ਮੋਟੀ-ਮੋਟੀ ਅੱਖਾਂ ਤਿਖੀ ਨੱਕ ਅਤੇ ਸਾਉਲੇ #ਰਂਗ ਵਾਲੀ ਔਰਤ. <br>'''
-altres '''n''''han apreciat l'espontaneïtat, tenacitat i afecte
+''Girija brought that woman with him again today. The same thin, big-eyed, pointy nosed and dusky woman.''
 </blockquote>
+It is not difficult to see that most translations provided by Google Translate lead to a change in meaning. This is due to the following :
-It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.
+* Google Translate relies on the n-grams available to it.
+* In case of rarely used words, it fails to translate those and worse, fails to capture the tense.
+* In complex sentences, the chunking(stage 1 and 2 as per apertium model) fails hence leading to a failure in capturing meaning and very often, even generating any syntactically correct sentence.
+=== Implementation choices ===
+* '''3 stage transfer :''' I plan on using the 3-stage transfer similar to hin-urd since Hindi and Punjabi are (very) similar especially when it comes to syntax and even morphology.
+* '''Clean and consistent practices :''' As mentioned in the doc as well, it'll be attempted that the paradigm is defined such that it's root form is used always. What I mean by this is that if a word 'abc' takes certain inflections and it's forms are [abd, abde, abcf] then a forced pair for 'ab' won't be formed. This seems obvious, but has been done in the current dictionaries, and the actual reason behind choosing this will need to be verified.
+* '''AnnCorra Dependencies :''' In cases where the same word can have different translations and POS, syntactic information is not enough, universals dependencies will be sought at. I plan to incorporate AnnCorra dependencies here since these capture much more information and clear a lot of ambiguities. [http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper]
+* '''Manual Disambiguation :''' For verbs specifically, entries will be checked manually as much as possible since the tendency to shift from normal behavior is much more than any other category.
+* '''Transliteration :''' For borrowed words and Named entities(atleast single word NEs) transliteration will be used. This shouldn't be a problem for this pair, since the two languages are very similar(importantly in phonemic inventory) and have phonemic orthography.
+* '''WX notations :''' I also plan on adding the WX notations for all words in the bidix, similar to what has been done in the urd-hin bidix, so it's easier to understand for developers who can't read the script. This'll be crucial for future work involving Indian languages.
 === Current state of dictionaries ===
-A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here[insert link]. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
+A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these [https://docs.google.com/document/d/1GmBIlGVxMinhJJVZWnVWLHTs8vvKAuxA0vCLBDOyBJY/edit?usp=sharing here]. I plan on manually going through all the stems and finishing this list. This'll also help me in understanding certain choices and will help in the community bonding period. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.
+The problems in the current dictionaries include :
+*Multiple unnecessary analyses. '''Fix -''' Keep only first analysis and add others, if required, using <e r="RL">
+<pre>
+<pardef n="गलत__adj">
+  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="nom"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="nom"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="sg"/><s n="obl"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="sg"/><s n="obl"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="nom"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="nom"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="m"/><s n="pl"/><s n="obl"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="f"/><s n="pl"/><s n="obl"/></r></p></e> <br>
+  <e><p><l></l><r><s n="adj"/><s n="mfn"/><s n="sp"/></r></p></e>
+</pardef>
+</pre>
+*Inflections added separately. '''Fix -''' stick consistent to adding inflections to root in a single definition
+<pre>
+<e><p><l>जब<s n="adv"/></l><r>ਜਦ<s n="adv"/></r></p></e>
+<e><p><l>जब<s n="adv"/></l><r>ਜਦੋ<s n="adv"/></r></p></e>
+<e><p><l>जब<s n="adv"/></l><r>ਜਦੋਂ<s n="adv"/></r></p></e>
+</pre>
+*Multiple translations of same word(in bidix). While this is fine when going from right to left, it's not intuitive which definition is picked during translation from left to right. '''Fix -''' add some extra flag/comment
+<pre>
+<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਅੱਖਰ<s n="n"/><s n="m"/></r></p></e>
+<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>-<s n="n"/><s n="m"/></r></p></e>
+<e><p><l>अक्षर<s n="n"/><s n="m"/></l><r>ਸ਼ਬਦਾਂਗ<s n="n"/><s n="m"/></r></p></e>
+</pre>
+* Inconsistent/Statistically Incorrect pairs. '''Fix -''' Statistical and Manual disambiguation
+<pre>
+<e><p><l>एवं<s n="cnjcoo"/></l><r>ਤੇ<s n="cnjcoo"/></r></p></e>
+<e><p><l>और<s n="cnjcoo"/></l><r>ਅਤੇ<s n="cnjcoo"/></r></p></e>
+<e><p><l>खुद<s n="prn"/></l><r>ਖੁਦ<s n="prn"/></r></p></e>
+<e><p><l>ख़ुद<s n="prn"/></l><r>ਖ਼ੁਦ<s n="prn"/></r></p></e>
+<e><p><l>तुम<s n="prn"/></l><r>ਤੂੰ<s n="prn"/></r></p></e>
+</pre>
+* Incomplete morph - analysis (all forms not added). '''Fix -''' Statistical and Manual fixes, involving comparisons and additionss
+<pre>
+<e><p><l>हम<s n="prn"/></l><r>ਅਸੀਂ<s n="prn"/></r></p></e>
+</pre>
+'''Why haven't I fixed these yet?''' Some of these are clearly errors but I wanted to first know why they exist in the first place and how the dictionaries have been compiled till now. <strike>I haven't been able to get in touch with Francis lately, neither through irc nor mail, but plan on finishing this asap(once the list is complete, I can raise an issue and work on my fork simultaneously).</strike> Hector also pointed to me that This makes morphological disambiguation harder, but probably transfer is easier. So, I want to confirm this first. As suggested by Francis, I'll be fixing these issues in the next PR hopefully(expected by 15 April) but also start work on learning the Urdu alphabet and checking if these changes affect the urd-hin pair in any way.
 === Resources ===
-[to be added - under confirmation for public use]
+[to be added - under confirmation for public use] <br>
+[https://hi.wiktionary.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A5%8D%E0%A4%B7%E0%A4%A8%E0%A4%B0%E0%A5%80:%E0%A4%AA%E0%A4%82%E0%A4%9C%E0%A4%BE%E0%A4%AC%E0%A5%80-%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80_%E0%A4%B6%E0%A4%AC%E0%A5%8D%E0%A4%A6%E0%A4%95%E0%A5%8B%E0%A4%B6_%E0%A4%85_%E0%A4%B8%E0%A5%87_%E0%A4%94 Hindi-Punjabi Dictionary - wiktionary] <br>
+[https://glosbe.com/pa/hi Punjabi-Hindi dictionary - Glosbe] (awaiting confirmation) <br>
+[https://pa.wikipedia.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%B8%E0%A8%AB%E0%A8%BC%E0%A8%BE Punjabi Articles - Wikipedia] <br>
+[https://pa.wiktionary.org/wiki/%E0%A8%AE%E0%A9%81%E0%A9%B1%E0%A8%96_%E0%A8%AA%E0%A9%B0%E0%A8%A8%E0%A8%BE Punjabi Dictionary - Wiktionary] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/pa/ Wikidumps-punjabi 1] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/pa/ Wikidumps-punjabi 2] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/pa/ Wikidumps-punjabi 3] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/2008-06/hi/ Wikidumps-hindi 1] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/April_2007/hi/ Wikidumps-hindi 2] <br>
+[https://dumps.wikimedia.org/other/static_html_dumps/September_2007/hi/ Wikidumps-hindi 3]
 === Workplan ===
@@ Line 69: / Line 163: @@
 ! style="width: 13%" | Coverage
 |-style="background-color:#dbfedb;"
-| style="text-align:center" | COMMUNITY BONDING PERIOD
+| style="text-align:center" | Post Application Period
 |
-* START:April 26th
+* START:April 6th
-* END:May 17th
+* END:May 3rd
 |
 * List and discuss implementation choices of hin-pan bidix and urd-hin pair
@@ Line 82: / Line 176: @@
 |
 |-style="background-color:#b3ffb3;"
-| style="text-align:center" | Week ONE : CLOSED CATEGORIES
+| style="text-align:center" | Community Bonding Period : Closed Categories
 |
-* START:May 18th
+* START:May 4th
 * END:May 24th
 |
@@ Line 92: / Line 186: @@
 |
 |-style="background-color:#aaffaa;"
-| style="text-align:center" | Week TWO : Adjectives
+| style="text-align:center" | Community Bonding Period : Adjectives
 |
 * START:May 25th
@@ Line 104: / Line 198: @@
 |
 |-style="background-color:#91f991;;"
-| style="text-align:center" | WEEK THREE: Verbal Paradigms
+| style="text-align:center" | Week ONE: Verbal Paradigms
 |
 * START:June 1st
@@ Line 112: / Line 206: @@
 * Expanding bilingual dictionary
 * Lexical selection rules for verbs
-* test adj, adv
+* testvoc : adj, adv
-| style="text-align:center" | ~ 1,000
+| style="text-align:center" | ~ 3,000
 |
 |
-* Third version of the wrapper with all functions importable from python.
 |-style="background-color:#82fa82;"
-| style="text-align:center" | Week FOUR: Dictionary Expansion
+| style="text-align:center" | Week TWO: Dictionary Expansion
 |
 * START:June 8th
@@ Line 125: / Line 218: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
-| style="text-align:center" | ~ 3,500
+| style="text-align:center" | ~ 5,000
 |
 |
 |-style="background-color: #64ff64;"
-| style="text-align:center" | Week FIVE
+| style="text-align:center" | Week THREE: Dictionary Expansion
 |
 * START:June 15th
@@ Line 136: / Line 229: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
-| style="text-align:center" | ~ 5,500
+| style="text-align:center" | ~ 6,500
+| style="text-align:center" | < 25% (hin-pan)
-|
+| style="text-align:center" | > 65% (hin-pan) <br> >60% (pan-hin)
-|
 |-style="background-color:#3dff3d;"
-| style="text-align:center" | Week SIX
+| style="text-align:center" | Week FOUR: More works on verbs and testing
 |
 * START:June 15th
@@ Line 147: / Line 240: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
+* Manual Disambiguation of rules hin-pan(src-trg)
 | style="text-align:center" | ~ 7,500
 |
 |
 |-style="background-color:#13ff13;"
-| style="text-align:center" | Week SEVEN
+| style="text-align:center" | Week FIVE : focus on Nouns
 |
 * START:June 22nd
@@ Line 161: / Line 255: @@
 |
 |
-|-style="background-color:#03fb03;"
+|-style="background-color:#00ee00;"
-| style="text-align:center" | Week EIGHT
+| style="text-align:center" | Week SIX : Expanding Dictionaries
 |
 * START:June 29th
@@ Line 170: / Line 264: @@
 * Lexical selection rules
 '''First Evaluation(June 29th - July 3rd)'''
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 10,500
 |
 |
-|-style="background-color: #00ee00;"
+|-style="background-color: #00df00;"
-| style="text-align:center" | Week NINE
+| style="text-align:center" | Week SEVEN : Expanding Dictionaries
 |
 * START:July 6th
@@ Line 181: / Line 275: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
+* Manual disambiguation of rules(pan-hin)
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 12,000
 |
 |
 |-style="background-color:#00d700;"
-| style="text-align:center" | Week TEN
+| style="text-align:center" | Week EIGHT : Transfer rules(hin-pan)
 |
 * START:July 13th
@@ Line 192: / Line 287: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
+* Transfer rules(hin-pan)
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 13,000
 |
 |
 |-style="background-color:#00b600;"
-| style="text-align:center" | Week ELEVEN
+| style="text-align:center" | Week NINE : Transfer rules
 |
 * START:July 20th
@@ Line 203: / Line 299: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
+* Transfer rules : pan-hin
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 14,000
 |
 |
 |-style="background-color:#009d00;"
-| style="text-align:center" | Week TWELVE
+| style="text-align:center" | Week TEN
 |
 * START:July 27th
@@ Line 214: / Line 311: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
-'''Second Evaluation(July 27th - Jult 31st)'''
+'''Second Evaluation(July 27th - July 31st)'''
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 15,000
+| style="text-align:center" | <20% (hin-pan) <br> <25% (pan-hin)
-|
+| style="text-align:center" | >82% (hin-pan) <br> >77% (pan-hin)
-|
 |-style="background-color:#008a00;"
-| style="text-align:center" | Week THIRTEEN
+| style="text-align:center" | Week ELEVEN
 |
 * START:August 3rd
@@ Line 226: / Line 323: @@
 * Expanding bilingual dictionary
 * Lexical selection rules
+* Disambiguation rules
-| style="text-align:center" | ~ 7,500
+* Transfer rules
+| style="text-align:center" | ~ 16,000
 |
 |
 |-style="background-color:#007e00;"
-| style="text-align:center" | Week FOURTEEN
+| style="text-align:center" | Week TWELVE : Testvoc
 |
 * START:August 10th
 * END:August 16th
 |
+* Testvoc hin-pan
-* Expanding bilingual dictionary
-* Lexical selection rules
+* Add rules, words
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 16,500
 |
 |
 |-style="background-color:#006d00;"
-| style="text-align:center" | Week FIFTEEN
+| style="text-align:center" | Week THIRTEEN : Finishing up
 |
 * START:August 17th
 * END:August 23rd
 |
+* Testvoc pan-hin
-* Expanding bilingual dictionary
-* Lexical selection rules
+* Add rules, words
+'''PERSONAL CODE FREEZE : August 22nd'''
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 17,000
 |
 |
 |-style="background-color:#005a00;"
-| style="text-align:center" | Week FIFTEEN
+| style="text-align:center" | Week FOURTEEN : Review
 |
 * START:August 24th
 * END:August 30th
 |
+* Review and documentation
-* Expanding bilingual dictionary
-* Lexical selection rules
 '''Final evaluation(August 24th - August 31st)
-| style="text-align:center" | ~ 7,500
+| style="text-align:center" | ~ 17,000
+| style="text-align:center" | ~15% (hin-pan) <br> <20% (pan-hin)
-|
+| style="text-align:center" | ~90% (hin-pan) <br> ~83% (pan-hin)
-|
 |}
 == Skills ==
-I'm currently a third year(commencing start of April '20 hopefully :D ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more.
+I'm currently a third year(concluding in early April '20 ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I am also a teaching assistant for courses on Language Typology, Universals and Historical Linguistics this semester(have TA'd for courses on NLP last semester), so I understand linguistic concepts very well along with the handling of linguistic data.
 I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
-I also have a lot of experience studying and generating data which I feel is essential in solving any problem, especially the one mentioned in this proposal. My paper on ''''Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus'''' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at '''LREC 2020'''. I am working on extending the same for Punjabi using Transfer learning.
+I also have a lot of experience studying and generating data which I feel is important especially for the problem mentioned in this proposal. My paper on ''''Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus'''' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at '''LREC 2020''' and at WILDRE-5(again LREC 2020). The project enlisted in the paper presents the largest dataset for the purpose of event detection. I am working on extending the same for Punjabi using Transfer learning. ([https://sigsem.uvt.nl/isa16/ ISA list of accepted papers], [https://www.researchgate.net/publication/340266259_Hindi_TimeBank_An_ISO-TimeML_Annotated_Reference_Corpus Link to paper])
 I am also closely involved with the committee conducting Asia-Pacific Linguistics Olympiad(which holds a camp, mentors and prepares students for the International Linguistics Olympiad) and help with the organisation and judging for the same.
-Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, etc. all of which required a working understanding of Natural Language Processing. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
+Due to the focused nature of our courses, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, a Linux based shell etc. all of which required a working understanding of Natural Language Processing and scripting. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.
 I am fluent in English, Hindi and Punjabi.
 == Coding challenge ==
-I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here :
+I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi Coding challenge repository] <br>
+Original corpus : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/texts/story.hin.txt source lang-hin] <br>
-Original corpus(source lang-hin) -
-Translated output(target lang-pan) -
+Translated output : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt target lang-pan] <br>
+Human Translation : [https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/pan_original_translation target lang-pan(human)] <br>
-Human Translation(pan) -
+Results : Source - Hindi, Target - Punjabi (evaluator output included in repo)<br>
+(to be checked and revised since WER and PER before and after removing unknown words remains same even though the error on not identifying unrecognized words was fixed after consulting @TinoDidriksen) <br>
+WER achieved : 15.30 % <br>
+PER achieved : 15.03 %
+Currently I'm working on finishing my list on the errors I could find in the existing files(See Section 4.7 : Current state of dictionaries). Once this is complete, I'll go ahead exploring and discussing the AnnCorra scheme for covering some of these ([http://docshare01.docshare.tips/files/20536/205364421.pdf link to paper]) This scheme captures dependency relations in much more detail than UD(Universal Dependency). (See section 4.6 for details on why it's required). While I'm more than familiar with AnnCorra, I'll have to check how to integrate it in the apertium pipeline, that is if the mentors think it is useful. <br>
+Once this is complete, I'll finish the compilation of texts from the dumps to get statistical usage of words. I plan to finish all this before the community bonding period is midway, so that I can meet the deliverables as soon as possible and get a chance to contribute to other problems(mostly strengthening my understanding of the hin-eng pair).
 == Non-Summer-of-Code plans for the Summer ==
-Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 30-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a couple of weeks since the coursework is already underway online and is expected to be over before start of the project).
+Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 35-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a week since the coursework is already underway online and is expected to be over well before start of the project).
+[[Category:GSoC 2020 student proposals]]