Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"

From Apertium
Jump to navigation Jump to search
Line 71: Line 71:
 
| COMMUNITY BONDING PERIOD
 
| COMMUNITY BONDING PERIOD
 
|
 
|
* START:April 23rd
+
* START:April 26th
* END:May 13th
+
* END:May 17th
  +
|
  +
* List and discuss implementation choices of hin-pan bidix and urd-hin pair
  +
* Reading up on the details of Transfer rules(whether or not a 3-stage transfer is the best way for this pair) and assigning weights
  +
* Finding Language Resources
  +
* Making Frequency lists for the language pair(Hindi-Punjabi)
  +
|
 
|
 
|
* Playing around with the lttoolbox and apertium modules and using every function and understanding all the flags and arguments of the functions.
 
* Reading up on the details of SWIG.
 
* Taking inputs from various apertium users on what would be the ideal implementation that they would want.
 
 
|
 
|
 
|-style="background-color:#b3ffb3;"
 
|-style="background-color:#b3ffb3;"
| Week ONE : Lttoolbox setup
+
| Week ONE : CLOSED CATEGORIES
 
|
 
|
* START:May 14th
+
* START:May 18th
* END:May 20th
+
* END:May 24th
 
|
 
|
  +
* Function words.
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface
 
  +
* Transfer rules for post-positions
* Testing all pointer based data manipulation for any errors. (A common problem that might occur with swig bindings)
 
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file,
 
* Identifying Static Class members, Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually.
 
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces)
 
 
|
 
|
  +
|
* First importable wrapper of Lttoolbox module
 
 
|-style="background-color:#99ff99;"
 
|-style="background-color:#99ff99;"
  +
| Week TWO : Adjectives
| WEEK TWO: Variable handling in SWIG for Lttoolbox module
 
 
|
 
|
* START:May 21st
+
* START:May 25th
* END:May 27th
+
* END:May 31st
 
|
 
|
  +
* Punjabi monodictionary : adjectives
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface
 
  +
* Expanding bilingual dictionary
* Testing all pointer based data manipulation for any errors. (A common problem that might occur with swig bindings)
 
  +
* Lexical selection rules for adj
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file
 
* Identifying Static Class members: Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually.
 
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces)
 
 
|
 
|
  +
|
* Second version of the wrapper with all data type usage support
 
  +
|
 
|-style="background-color:#80ff80;;"
 
|-style="background-color:#80ff80;;"
 
| WEEK THREE: Templating and Object Handling for Lttoolbox module
 
| WEEK THREE: Templating and Object Handling for Lttoolbox module

Revision as of 12:30, 25 March 2020

Contact Information

Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM

Why I am interested in Apertium

Because Apertium is free/open-source software.
Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
Because there is lot of good work done and being done in it.
Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.

My Proposal

Why Google and Apertium should sponsor it

  • Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
  • Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 2.1) On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 3 : Coding Challenge).
  • I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator for this pair.
  • This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and can be an important resource.

How and who it will benefit in society

As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.

Google Translate : Analysis and comparison

Google Translate provides an interface to translate the pair in question. I have analysed the results of the translation into Punjabi from Google. The numerical results(computed on a small set of sentences from the coding challenge. The human translation which has been reviewed by 3 annotators is also available in the repo) are given below(source-target):

  • hin-pan: 14.0% WER
  • pan-hin: 21.6% WER

The results are far from wonderful, especially when it comes to longer sentences with less frequently used words. Seemingly, Google translates using both Spanish and English as bridge languages, as can be seen, for example, by words that appear in these two languages in the final text (supposedly in Catalan) and that were not in the original Italian or Portuguese text. The use of English as intermediate between Romance languages causes problems known to all users, such as the translation of p2.pl verb forms with elided subject to p2.sg, the incorrect choice of past times in the verbs and the disappearance of some pronouns. Here is an example of the last case of the Italian test text (randomly obtained):

Original text (bold mine):

altri invece ne hanno apprezzato la spontaneità, la tenacia e l'affettuosità

Google translation:

altres han apreciat la seva espontaneïtat, tenacitat i afecte

Post-edited translation:

altres n'han apreciat l'espontaneïtat, tenacitat i afecte

It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.

Current state of dictionaries

A released module already exists for Hindi(as part of the urd-hin pair). However there still exist a lot of anomalies in the Hindi mono-dictionary. I've compiled a preliminary version of list of some these here[insert link]. Apart from these, the existing state of the hin-pan bi-dictionary also needs massive improvement. The first step of this project will be to revise these lists of issues and come up with a sustainable solution. It'll be crucial that the changes made, especially to the Hindi mono-dictionary do no affect the urd-hin pair(and the hindi-begali, hindi-marathi and hindi-gujarati pairs which also have little but some work done) in a negative way.

Resources

[to be added - under confirmation for public use]

Workplan

PHASE DURATION GOALS OF THE WEEK BIDIX WER Coverage
COMMUNITY BONDING PERIOD
  • START:April 26th
  • END:May 17th
  • List and discuss implementation choices of hin-pan bidix and urd-hin pair
  • Reading up on the details of Transfer rules(whether or not a 3-stage transfer is the best way for this pair) and assigning weights
  • Finding Language Resources
  • Making Frequency lists for the language pair(Hindi-Punjabi)
Week ONE : CLOSED CATEGORIES
  • START:May 18th
  • END:May 24th
  • Function words.
  • Transfer rules for post-positions
Week TWO : Adjectives
  • START:May 25th
  • END:May 31st
  • Punjabi monodictionary : adjectives
  • Expanding bilingual dictionary
  • Lexical selection rules for adj
WEEK THREE: Templating and Object Handling for Lttoolbox module
  • START:May 28th
  • END:June 3rd
  • In order to create wrappers, one has to tell SWIG to create wrappers for a particular template instantiation. Hence all the templates have to be explicitly declared specific to the data being manipulated in them,.
  • C++ Reference Counted Objects: Referencing and Dereferencing of objects have to be taken care of so that no error occurs, another place where SWIG isn’t smart enough.
  • Handling C++ overloaded functions: Overloading support is not quite as flexible as in C++. Sometimes there are methods that SWIG can't disambiguate, if such errors appear then they have to be taken care of manually in the interface file of the wrapper.
  • Third version of the wrapper with all functions importable from python.
WEEK FOUR:Testing and improving cross language polymorphism, Making the module more pythonistic, Exception Handling
  • START:June 4th
  • END:June 10th
  • Implement Director Classes: No mechanism exists to pass method calls down the inheritance chain from C++ to Python. In particular, if a C++ class has been extended in Python, these extensions will not be visible from C++ code. Virtual method calls from C++ are thus not able access the lowest implementation in the inheritance chain. There exists a feature implemented in SWIG called directors, The job of the directors is to route method calls correctly, either to C++ implementations higher in the inheritance chain or to Python implementations lower in the inheritance chain.
  • Writing c++ helper functions: Sometimes the SWIG module misses bits of functionality because there is no easy way to construct and manipulate a suitable datatype, for those cases c++ helper functions need to be written.
  • Writing High-Level Python function to provide a high-level Python interface built on top of low-level helper functions.Error Handling: If C++ throws an error then it is better to convert it into a python exception.
  • Fourth and final version with input functions and all helper functions written in python.
WEEK FIVE: Apertium setup
  • START:June 11th
  • END:June 17th
  • Ref Week 1(***)
  • First version of the apertium module that is python importable
WEEK SIX: Variable handling in SWIG for Apertium module
  • START:June 18th
  • END:June 24th
  • Ref Week 2
  • Second version of the apertium wrapper
WEEK SEVEN: Templating and Object Handling for Lttoolbox module
  • START:June 25th
  • END:July 1st
  • Ref Week 3
  • Third version of the apertium wrapper with all functions importable from python
WEEK EIGHT: Testing and improving cross language polymorphism, Making the module more pythonistic, Exception Handling
  • START:July 2nd
  • END:July 8th
  • Ref Week 4
  • Fourth and final version with input functions and all helper functions written in python.
WEEK NINE: Extensive alpha testing of modules built
  • START:July 9th
  • END:July 15th
  • Testing the modules built by writing unit-tests for the functions in the modules
  • Starting the documentation of the modules, since there are a lot of funcntions and the way swig deals with python is a little different than raw python, proper documentation of all the modules and their usages
  • Version 1 Documentation written
  • Tests written for the lttoolbox module
WEEK TEN: Finishing Documentation
  • START:July 16th
  • END:July 22nd
  • Finishing the documentation of the module
  • Distribute for Beta testing, so that end users validate the usability, functionality, compatibility, and reliability
  • Tests written for apertium module
  • Documentation version 2.
WEEK ELEVEN: Beta testing and changes(if any)
  • START:July 23rd
  • END:July 29th
  • Taking reviews of beta testing and implementing changes if any.
  • Review Fix Version of wapper realease
WEEK TWELVE: Deciding on the library structure and making module pip installable
  • START:July 30th
  • END:August 5th
  • Making the super wrapper for the modules.
  • Making the module pip installable by writing scripts and uploading to PyPI
  • Update Documentation
  • One wrapper with the 2 created wrappers inside it
  • Pip Installable module
WEEK THIRTEEN: Final reviews and bug report
  • START:August 6th
  • END:August 14th
  • Analyse and make bug report for the bugs in the code.
  • Make Final documentation
  • Release Final Module
  • Final Release of the wrapper.

(***)The tasks are similar to the tasks of the referenced week

Skills

I'm currently a third year(commencing start of April '20 hopefully :D ) student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying and generating data which I feel is essential in solving any problem, especially the one mentioned in this proposal. My paper on 'Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus' recently got accepted in 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC 2020. I am working on extending the same for Punjabi using Transfer learning.

Due to the focused nature of our course, I have worked in several projects, such as building Anaphora Resolution systems, Abstractive Summarizers(using Pointer-generators, hierarchical attention and transformers), POS Taggers, Named Entity Recognisers, simple Q-A systems, etc. all of which required a working understanding of Natural Language Processing. Some of these projects aren't available on GitHub because of the privacy settings but can be provided if required.

I am fluent in English, Hindi and Punjabi.

Coding challenge

I've completed the coding challenge for translation from Hindi into Punjabi. You can find my work here : Original corpus(source lang-hin) - Translated output(target lang-pan) - Human Translation(pan) -

Non-Summer-of-Code plans for the Summer

Since I'll be having my college summer vacations for almost the entire duration of the project, I can easily spend 30-40 hours per week on the project. Since, the academic schedule might vary a little bit due to lock downs for prevention of COVID-19, I'll be starting work early and cover the problems in the post-application period. I've also kept workload slightly heavier in the first 2 weeks to cover up any unlikely, uncertain extensions in academics that might show up. Even then, I can spend around 20 hours a week in any case(note that this is a very unlikely situation and even then this period won't last more than a couple of weeks since the coursework is already underway online and is expected to be over before start of the project).