Difference between revisions of "User:Venkat1997/proposal"

From Apertium
Jump to navigation Jump to search
Line 14: Line 14:
 
== Reasons why Google and Apertium should sponsor it ==
 
== Reasons why Google and Apertium should sponsor it ==
 
* Make the process of generating lexical rules a lot more user-friendly.
 
* Make the process of generating lexical rules a lot more user-friendly.
* Current script is difficult to understand and modify. Upon completion of project, further improvements on generation of lexical selection rules is also made easier.
+
* Current script is difficult to understand and modify. Upon completion of project, integrating further improvements on generation of lexical selection rules to the script is also made easier.
 
* Improvements to current language pairs can be performed effectively.
 
* Improvements to current language pairs can be performed effectively.
  +
== How and who will it benefit in society? ==
  +
* Pair maintainers as they will have an easier way of adding rules to the pair.
  +
* Users who want to start a new language pair on their own.
  +
== Plan ==
  +
=== Bonding Period ===
  +
* Understand the current workflow for extracting lexical selection rules.
  +
* Understand existing code of Nikita Medyankin.
  +
* Try making a new language pair by myself to understand how Apertium works more clearly.
  +
=== Coding Phase ===
  +
'''Week-1'''
  +
* Devise a config format that incorporates all the different options possible for lexical selection training.
  +
'''Week-2'''
  +
* Begin work on the driver script.
  +
** Complete the module on validating config file.
  +
* Write unit tests to ensure that the validation module is working properly.
  +
'''Week-3'''
  +
* Write a script that takes care of building the third-party tools (IRSTLM, Giza++, Moses).
  +
* If the user has already installed the third party tools, ensure that they are installed properly by checking if the required binaries exist in the user specified directory.
  +
'''Week-4'''
  +
* Make the process more user-friendly by checking and fixing common issues automatically. (Reference: http://wiki.apertium.org/wiki/Installation_troubleshooting)
  +
* Print helpful messages to user which aid the user in fixing the issue when the script fails.
  +
* Test the deliverable up till now on different operating systems, versions of GCC, version of Python etc. to check compatibility.
  +
'''Week-5'''
  +
* Complete the following steps in the script (Parallel corpora):
  +
** Preprocess corpora.
  +
** Split the corpus provided by the user into training and test corpus. (Include a parameter for the ratio of training to test corpus which the user can modify. Set the ratio default to 80% training and 20% Test) Run training on the training part of the corpus.
  +
'''Week-6'''
  +
* Complete the following steps in the script (Parallel corpora):
  +
**Produce an .lrx file. (Include both ways: Maximum Likelihood Extraction & Maximum Entropy Rule Extraction)
  +
* Test the newly added functionalities using different corpora to ensure they are working properly.
  +
'''Week-7 & 8'''
  +
* Repeat the tasks done in the past two weeks on Non-Parallel Corpora.
  +
'''Week-8'''
  +
* Test the .lrx file on the held-out test corpus by editing the pipeline in the modes.xml file of the language pair and using apertium-eval-translator after that to check the quality of translation.
  +
'''Week-9'''
  +
* Won't be available as I am going on a vacation.
  +
'''Week-10'''
  +
* Create regression tests for the driver script by including a small corpus along with the script and checking whether the driver script returns the correct output after changes to any of the components of the script. (Apertium Tools, Third-Party Tools or even the script itself)
  +
'''Week-11'''
  +
* Find language pairs that don’t have many lexical selection rules and run the above script to extract rules for those language pairs.
  +
'''Week-12'''
  +
* Check whether the rules acquired in the above step improve quality and add them to the existing language pairs if they do. (Perform this step in collaboration with language pair maintainers.)
  +
== Skills & Qualifications ==
  +
I am a B.Tech Computer Science & Engineering student at IIIT Hyderabad. I am proficient in many languages including but not limited to:
  +
* C/C++
  +
* Python
  +
* Bash Scripting
  +
* Java
  +
* Android
  +
I have completed many projects and you can check them out on my [https://github.com/venkatp1997 Github Profile].
  +
Last year, I contributed to OpenMRS (Made 3 commits to the OpenMRS Radiology Module and 1 commit to OpenMRS Reference Application both in Java) but unfortunately the project I was applying for got dropped. I have also completed all the coding challenges required for this task. ([https://github.com/venkatp1997/User-friendly-lexical-selection-training Link]) I believe that I am proficient enough to complete this task successfully.
  +
== Non-Summer-of-Code plans ==
  +
None. Just a week of vacation as mentioned above.

Revision as of 08:26, 1 April 2017

Contact Information

Name: Venkat Parthasarathy

E-Mail: venkat.p1997@gmail.com

IRC: venkat

Why am I interested in machine translation?

I come from India, where there exists 22 scheduled languages and almost every other state speaks a different language. From a young age, I was interested in how to translate between languages because it would really help to travel between states in a country like India. I was introduced to machine translation by a ML course at my college and I was fascinated by how the complex process of translation between languages was neatly encapsulated in probabilistic models. Recently, I also read an article on how Google had vastly improved its translation quality by applying Artificial Neural Networks to machine translation. Upon reading Google's work, I became even more excited about this field and its far reaching applications. I wanted to get involved in a project which would aid in helping me understand more about this subject. I believe that the project I have chosen for GSoC will help me in achieving that goal as I would be working on the one of the most important aspects machine translation pipeline of Apertium directly, generating lexical rules. Being a computer science student, it would also be a good exercise in programming and software engineering practices.

Why is it that I am interested in Apertium?

Apertium is one of the few translation platforms that has both a helpful community and detailed documentation. It allows users to understand what is actually happening behind machine translation. For example, I spent some time on IRC discussing with Unhammer on how to add more custom lexical rules and evaluate their peformance. He promptly explained what was happening and also pointed me to some wiki pages for more details. I was able to understand, first-hand, how critical tasks were performed and how a language pair is actually created. Many experiences like these made Apertium a very likeable organization. Not only is it one of the most robust machine translation platforms but also its documentation and community enables anyone to learn what machine translation is.

Which of the published tasks am I interested in? What do I plan to do?

I am interested in the task User-friendly lexical selection training. I plan to extend Nikita Medyankin's work on the driver script by refactoring his code and removing unnecessary scripts, adopting a more user friendly yaml config file, making the installation of third-party tools easier (maybe even removing some of them that are currently used), providing regression tests to this driver script and finally testing the work by running on some language pairs that don't have many rules and adding those rules to the pair if it improves quality.

Reasons why Google and Apertium should sponsor it

  • Make the process of generating lexical rules a lot more user-friendly.
  • Current script is difficult to understand and modify. Upon completion of project, integrating further improvements on generation of lexical selection rules to the script is also made easier.
  • Improvements to current language pairs can be performed effectively.

How and who will it benefit in society?

  • Pair maintainers as they will have an easier way of adding rules to the pair.
  • Users who want to start a new language pair on their own.

Plan

Bonding Period

  • Understand the current workflow for extracting lexical selection rules.
  • Understand existing code of Nikita Medyankin.
  • Try making a new language pair by myself to understand how Apertium works more clearly.

Coding Phase

Week-1

  • Devise a config format that incorporates all the different options possible for lexical selection training.

Week-2

  • Begin work on the driver script.
    • Complete the module on validating config file.
  • Write unit tests to ensure that the validation module is working properly.

Week-3

  • Write a script that takes care of building the third-party tools (IRSTLM, Giza++, Moses).
  • If the user has already installed the third party tools, ensure that they are installed properly by checking if the required binaries exist in the user specified directory.

Week-4

  • Make the process more user-friendly by checking and fixing common issues automatically. (Reference: http://wiki.apertium.org/wiki/Installation_troubleshooting)
  • Print helpful messages to user which aid the user in fixing the issue when the script fails.
  • Test the deliverable up till now on different operating systems, versions of GCC, version of Python etc. to check compatibility.

Week-5

  • Complete the following steps in the script (Parallel corpora):
    • Preprocess corpora.
    • Split the corpus provided by the user into training and test corpus. (Include a parameter for the ratio of training to test corpus which the user can modify. Set the ratio default to 80% training and 20% Test) Run training on the training part of the corpus.

Week-6

  • Complete the following steps in the script (Parallel corpora):
    • Produce an .lrx file. (Include both ways: Maximum Likelihood Extraction & Maximum Entropy Rule Extraction)
  • Test the newly added functionalities using different corpora to ensure they are working properly.

Week-7 & 8

  • Repeat the tasks done in the past two weeks on Non-Parallel Corpora.

Week-8

  • Test the .lrx file on the held-out test corpus by editing the pipeline in the modes.xml file of the language pair and using apertium-eval-translator after that to check the quality of translation.

Week-9

  • Won't be available as I am going on a vacation.

Week-10

  • Create regression tests for the driver script by including a small corpus along with the script and checking whether the driver script returns the correct output after changes to any of the components of the script. (Apertium Tools, Third-Party Tools or even the script itself)

Week-11

  • Find language pairs that don’t have many lexical selection rules and run the above script to extract rules for those language pairs.

Week-12

  • Check whether the rules acquired in the above step improve quality and add them to the existing language pairs if they do. (Perform this step in collaboration with language pair maintainers.)

Skills & Qualifications

I am a B.Tech Computer Science & Engineering student at IIIT Hyderabad. I am proficient in many languages including but not limited to:

  • C/C++
  • Python
  • Bash Scripting
  • Java
  • Android

I have completed many projects and you can check them out on my Github Profile. Last year, I contributed to OpenMRS (Made 3 commits to the OpenMRS Radiology Module and 1 commit to OpenMRS Reference Application both in Java) but unfortunately the project I was applying for got dropped. I have also completed all the coding challenges required for this task. (Link) I believe that I am proficient enough to complete this task successfully.

Non-Summer-of-Code plans

None. Just a week of vacation as mentioned above.