Difference between revisions of "User:Venkat1997/proposal"

From Apertium
Jump to navigation Jump to search
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:GSoC_2017_Student_Proposals]]
== Contact Information ==
== Contact Information ==
'''Name''': Venkat Parthasarathy
'''Name''': Venkat Parthasarathy
Line 7: Line 8:


== Why am I interested in machine translation? ==
== Why am I interested in machine translation? ==
I come from India, where there exists 22 scheduled languages and almost every other state speaks a different language. From a young age, I was interested in how to translate between languages because it would really help to travel between states in a country like India. I was introduced to machine translation by a ML course at my college and I was fascinated by how the complex process of translation between languages was neatly encapsulated in probabilistic models. Recently, I also read an article on how Google had vastly improved its translation quality by applying Artificial Neural Networks to machine translation. Upon reading Google's work, I became even more excited about this field and its far reaching applications. I wanted to get involved in a project which would aid in helping me understand more about this subject. I believe that the project I have chosen for GSoC will help me in achieving that goal as I would be working on the one of the most important aspects machine translation pipeline of Apertium directly, generating lexical selection rules. Being a computer science student, it would also be a good exercise in programming and software engineering practices.
I am fascinated by how machine translation is enabling language to become less and less of a barrier for interaction between people. I live in India where people in different states speak different languages. Before, visiting just a neighboring state would be a challenging task without knowing the language spoken in that state. However, with the rise of robust machine translation systems (like Google Translate), people are able to move around freely without language being a hindrance. I was piqued by the power of these systems as they were even able to translate complicated sentences from one language to another. I wanted to learn more about how this process of translation is achieved and I believe that my project would be a great step in understanding this process. I am also very excited by the prospect of working with experienced mentors like Francis Tyers and Unhammer.

== Why is it that I am interested in Apertium? ==
Apertium has a helpful community and well written documentation. It allows users to understand how machine translation is actually implemented in practice. I am particularly impressed by how experienced mentors promptly respond to any queries put forth by applicants. My experiences with Apertium so far has made me very enthusiastic about contributing to Apertium for Summer of Code.

== Which of the published tasks am I interested in? What do I plan to do? ==
I am interested in the task User-friendly lexical selection training. I plan to extend [https://github.com/tiefling-cat/apertium-flst/ Nikita Medyankin's work] on the driver script. The config file he uses could be made more user-friendly. His work also does not have an installer script that takes care of installing third-party tools currently and some scripts that he uses could be factored out. Some hard-to-install dependencies that are currently used can also be replaced. Hence, what I plan to do is refactor his code and remove unnecessary scripts, adopt a more user friendly yaml config file, make the installation of third-party tools easier (maybe even removing some of them that are currently used), provide regression tests to this driver script and finally test the work by running on some language pairs that don't have many rules and add those rules to the pair if they improve quality.

== Reasons why Google and Apertium should sponsor it ==
* Make the process of generating lexical selection rules a lot more user-friendly.
* Current script is difficult to understand and modify. Upon completion of project, integrating further improvements on generation of lexical selection rules to the script is also made easier.
* Improvements to current language pairs can be performed more effectively as now most pairs only have manually written rules and with the help of this script it will be easier to automatically generate good rules from corpora and add them.

== How and who will it benefit in society? ==
* Pair maintainers as they will have an easier way of generating a corpus based lrx and adding rules from it.
* Language pair developers who want to start a new language pair.

== Plan ==
=== Bonding Period ===
* Understand the current workflow for extracting lexical selection rules.
* Understand existing code of Nikita Medyankin.
* Pick a language pair to use for testing the driver script.
* Try making a new language pair by myself to understand how Apertium works more clearly.
=== Coding Phase ===
'''Week-1'''
* Begin work on the driver script.
* Devise a config format that incorporates all the different options possible for lexical selection training.
'''Week-2'''
* Complete the module on validating config file.
* Write unit tests to ensure that the validation module is working properly.
'''Week-3'''
* Finalize what third-party tools are going to used for the task.
** Test the performance of using KenLM in place of IRSTLM.
** Test the performance of using [https://github.com/clab/fast_align fast_align]/Poor Man's Alignment in place of the powerful Giza++.
* Write a script that takes care of building the third-party tools.
* If the user has already installed the third party tools, ensure that they are installed properly by checking if the required binaries exist in the user specified directory.
'''Week-4'''
* Make the process more user-friendly by checking and fixing common issues automatically. (Reference: http://wiki.apertium.org/wiki/Installation_troubleshooting)
* Print helpful messages to the language pair developer which aids them in fixing the issue when the script fails.
* Test the deliverable up till now on different operating systems, versions of GCC, version of Python etc. to check compatibility.
'''Week-5'''
* Complete the following steps in the script (Parallel corpora):
** Preprocess corpora.
** Split the corpus provided by the language pair developer into training and test corpus. (Include a parameter for the ratio of training to test corpus which the language pair developer can modify. Set the ratio default to 80% training and 20% Test) Run training on the training part of the corpus.
'''Week-6'''
* Complete the following steps in the script (Parallel corpora):
**Produce an .lrx file. (Include both ways: Maximum Likelihood Extraction & Maximum Entropy Rule Extraction)
* Test the newly added functionalities using different corpora to ensure they are working properly.
'''Week-7 & 8'''
* Repeat the tasks done in the past two weeks on Non-Parallel Corpora.
* Complete the module on testing the .lrx file on the held-out test corpus. Module must take care of editing the pipeline in the modes.xml file of the language pair and then using apertium-eval-translator after that to check the quality of translation.
'''Week-9'''
* Won't be available as I am going on a vacation.
'''Week-10'''
* Create regression tests for the driver script by including a small corpus along with the script and checking whether the driver script returns the correct output after changes to any of the components of the script. (Apertium Tools, Third-Party Tools or even the script itself)
'''Week-11'''
* Find language pairs that don’t have many lexical selection rules and run the above script to extract rules for those language pairs.
'''Week-12'''
* Check whether the rules acquired in the above step improve quality and add them to the existing language pairs if they do. (Perform this step in collaboration with language pair maintainers.)

== Skills & Qualifications ==
I am a B.Tech Computer Science & Engineering student at IIIT Hyderabad. I am proficient in many languages including but not limited to:
* C/C++
* Python
* Bash
* Java
* Android
I have completed many projects and you can check them out on my [https://github.com/venkatp1997 Github Profile].
Last year, I contributed to OpenMRS (Made 3 commits to the OpenMRS Radiology Module and 1 commit to OpenMRS Reference Application all in Java) but unfortunately the project I was applying for got dropped. I have also completed all the coding challenges required for this task. ([https://github.com/venkatp1997/User-friendly-lexical-selection-training Link]) I believe that I am proficient enough to complete my project successfully.

== Non-Summer-of-Code plans ==
None. Just a week of vacation as mentioned above.

Latest revision as of 10:37, 3 April 2017

Contact Information[edit]

Name: Venkat Parthasarathy

E-Mail: venkat.p1997@gmail.com

IRC: venkat

Why am I interested in machine translation?[edit]

I come from India, where there exists 22 scheduled languages and almost every other state speaks a different language. From a young age, I was interested in how to translate between languages because it would really help to travel between states in a country like India. I was introduced to machine translation by a ML course at my college and I was fascinated by how the complex process of translation between languages was neatly encapsulated in probabilistic models. Recently, I also read an article on how Google had vastly improved its translation quality by applying Artificial Neural Networks to machine translation. Upon reading Google's work, I became even more excited about this field and its far reaching applications. I wanted to get involved in a project which would aid in helping me understand more about this subject. I believe that the project I have chosen for GSoC will help me in achieving that goal as I would be working on the one of the most important aspects machine translation pipeline of Apertium directly, generating lexical selection rules. Being a computer science student, it would also be a good exercise in programming and software engineering practices.

Why is it that I am interested in Apertium?[edit]

Apertium has a helpful community and well written documentation. It allows users to understand how machine translation is actually implemented in practice. I am particularly impressed by how experienced mentors promptly respond to any queries put forth by applicants. My experiences with Apertium so far has made me very enthusiastic about contributing to Apertium for Summer of Code.

Which of the published tasks am I interested in? What do I plan to do?[edit]

I am interested in the task User-friendly lexical selection training. I plan to extend Nikita Medyankin's work on the driver script. The config file he uses could be made more user-friendly. His work also does not have an installer script that takes care of installing third-party tools currently and some scripts that he uses could be factored out. Some hard-to-install dependencies that are currently used can also be replaced. Hence, what I plan to do is refactor his code and remove unnecessary scripts, adopt a more user friendly yaml config file, make the installation of third-party tools easier (maybe even removing some of them that are currently used), provide regression tests to this driver script and finally test the work by running on some language pairs that don't have many rules and add those rules to the pair if they improve quality.

Reasons why Google and Apertium should sponsor it[edit]

  • Make the process of generating lexical selection rules a lot more user-friendly.
  • Current script is difficult to understand and modify. Upon completion of project, integrating further improvements on generation of lexical selection rules to the script is also made easier.
  • Improvements to current language pairs can be performed more effectively as now most pairs only have manually written rules and with the help of this script it will be easier to automatically generate good rules from corpora and add them.

How and who will it benefit in society?[edit]

  • Pair maintainers as they will have an easier way of generating a corpus based lrx and adding rules from it.
  • Language pair developers who want to start a new language pair.

Plan[edit]

Bonding Period[edit]

  • Understand the current workflow for extracting lexical selection rules.
  • Understand existing code of Nikita Medyankin.
  • Pick a language pair to use for testing the driver script.
  • Try making a new language pair by myself to understand how Apertium works more clearly.

Coding Phase[edit]

Week-1

  • Begin work on the driver script.
  • Devise a config format that incorporates all the different options possible for lexical selection training.

Week-2

  • Complete the module on validating config file.
  • Write unit tests to ensure that the validation module is working properly.

Week-3

  • Finalize what third-party tools are going to used for the task.
    • Test the performance of using KenLM in place of IRSTLM.
    • Test the performance of using fast_align/Poor Man's Alignment in place of the powerful Giza++.
  • Write a script that takes care of building the third-party tools.
  • If the user has already installed the third party tools, ensure that they are installed properly by checking if the required binaries exist in the user specified directory.

Week-4

  • Make the process more user-friendly by checking and fixing common issues automatically. (Reference: http://wiki.apertium.org/wiki/Installation_troubleshooting)
  • Print helpful messages to the language pair developer which aids them in fixing the issue when the script fails.
  • Test the deliverable up till now on different operating systems, versions of GCC, version of Python etc. to check compatibility.

Week-5

  • Complete the following steps in the script (Parallel corpora):
    • Preprocess corpora.
    • Split the corpus provided by the language pair developer into training and test corpus. (Include a parameter for the ratio of training to test corpus which the language pair developer can modify. Set the ratio default to 80% training and 20% Test) Run training on the training part of the corpus.

Week-6

  • Complete the following steps in the script (Parallel corpora):
    • Produce an .lrx file. (Include both ways: Maximum Likelihood Extraction & Maximum Entropy Rule Extraction)
  • Test the newly added functionalities using different corpora to ensure they are working properly.

Week-7 & 8

  • Repeat the tasks done in the past two weeks on Non-Parallel Corpora.
  • Complete the module on testing the .lrx file on the held-out test corpus. Module must take care of editing the pipeline in the modes.xml file of the language pair and then using apertium-eval-translator after that to check the quality of translation.

Week-9

  • Won't be available as I am going on a vacation.

Week-10

  • Create regression tests for the driver script by including a small corpus along with the script and checking whether the driver script returns the correct output after changes to any of the components of the script. (Apertium Tools, Third-Party Tools or even the script itself)

Week-11

  • Find language pairs that don’t have many lexical selection rules and run the above script to extract rules for those language pairs.

Week-12

  • Check whether the rules acquired in the above step improve quality and add them to the existing language pairs if they do. (Perform this step in collaboration with language pair maintainers.)

Skills & Qualifications[edit]

I am a B.Tech Computer Science & Engineering student at IIIT Hyderabad. I am proficient in many languages including but not limited to:

  • C/C++
  • Python
  • Bash
  • Java
  • Android

I have completed many projects and you can check them out on my Github Profile. Last year, I contributed to OpenMRS (Made 3 commits to the OpenMRS Radiology Module and 1 commit to OpenMRS Reference Application all in Java) but unfortunately the project I was applying for got dropped. I have also completed all the coding challenges required for this task. (Link) I believe that I am proficient enough to complete my project successfully.

Non-Summer-of-Code plans[edit]

None. Just a week of vacation as mentioned above.