User:Venkat1997/proposal

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Venkat Parthasarathy

E-Mail: venkat.p1997@gmail.com

IRC: venkat

Why am I interested in machine translation?

I come from India, where there exists 22 scheduled languages and almost every other state speaks a different language. From a young age, I was interested in how to translate between languages because it would really help to travel between states in a country like India. I was introduced to machine translation by a ML course at my college and I was fascinated by how the complex process of translation between languages was neatly encapsulated in probabilistic models. Recently, I also read an article on how Google had vastly improved its translation quality by applying Artificial Neural Networks to machine translation. Upon reading Google's work, I became even more excited about this field and its far reaching applications. I wanted to get involved in a project which would aid in helping me understand more about this subject. I believe that the project I have chosen for GSoC will help me in achieving that goal as I would be working on the one of the most important aspects machine translation pipeline of Apertium directly, generating lexical rules. Being a computer science student, it would also be a good exercise in programming and software engineering practices.

Why is it that I am interested in Apertium?

Apertium has a helpful community and well written documentation. It allows users to understand how machine translation is actually implemented in practice. I am particularly impressed by how experienced mentors promptly respond to any queries put forth by applicants. My experiences with Apertium so far has made me very enthusiastic about contributing to Apertium for Summer of Code.

Which of the published tasks am I interested in? What do I plan to do?

I am interested in the task User-friendly lexical selection training. I plan to extend Nikita Medyankin's work on the driver script by refactoring his code and removing unnecessary scripts, adopting a more user friendly yaml config file, making the installation of third-party tools easier (maybe even removing some of them that are currently used), providing regression tests to this driver script and finally testing the work by running on some language pairs that don't have many rules and adding those rules to the pair if it improves quality.

Reasons why Google and Apertium should sponsor it

  • Make the process of generating lexical rules a lot more user-friendly.
  • Current script is difficult to understand and modify. Upon completion of project, integrating further improvements on generation of lexical selection rules to the script is also made easier.
  • Improvements to current language pairs can be performed effectively.

How and who will it benefit in society?

  • Pair maintainers as they will have an easier way of adding rules to the pair.
  • Users who want to start a new language pair.

Plan

Bonding Period

  • Understand the current workflow for extracting lexical selection rules.
  • Understand existing code of Nikita Medyankin.
  • Try making a new language pair by myself to understand how Apertium works more clearly.

Coding Phase

Week-1

  • Devise a config format that incorporates all the different options possible for lexical selection training.

Week-2

  • Begin work on the driver script.
    • Complete the module on validating config file.
  • Write unit tests to ensure that the validation module is working properly.

Week-3

  • Write a script that takes care of building the third-party tools (IRSTLM, Giza++, Moses).
  • If the user has already installed the third party tools, ensure that they are installed properly by checking if the required binaries exist in the user specified directory.

Week-4

  • Make the process more user-friendly by checking and fixing common issues automatically. (Reference: http://wiki.apertium.org/wiki/Installation_troubleshooting)
  • Print helpful messages to user which aid the user in fixing the issue when the script fails.
  • Test the deliverable up till now on different operating systems, versions of GCC, version of Python etc. to check compatibility.

Week-5

  • Complete the following steps in the script (Parallel corpora):
    • Preprocess corpora.
    • Split the corpus provided by the user into training and test corpus. (Include a parameter for the ratio of training to test corpus which the user can modify. Set the ratio default to 80% training and 20% Test) Run training on the training part of the corpus.

Week-6

  • Complete the following steps in the script (Parallel corpora):
    • Produce an .lrx file. (Include both ways: Maximum Likelihood Extraction & Maximum Entropy Rule Extraction)
  • Test the newly added functionalities using different corpora to ensure they are working properly.

Week-7 & 8

  • Repeat the tasks done in the past two weeks on Non-Parallel Corpora.

Week-8

  • Test the .lrx file on the held-out test corpus by editing the pipeline in the modes.xml file of the language pair and using apertium-eval-translator after that to check the quality of translation.

Week-9

  • Won't be available as I am going on a vacation.

Week-10

  • Create regression tests for the driver script by including a small corpus along with the script and checking whether the driver script returns the correct output after changes to any of the components of the script. (Apertium Tools, Third-Party Tools or even the script itself)

Week-11

  • Find language pairs that don’t have many lexical selection rules and run the above script to extract rules for those language pairs.

Week-12

  • Check whether the rules acquired in the above step improve quality and add them to the existing language pairs if they do. (Perform this step in collaboration with language pair maintainers.)

Skills & Qualifications

I am a B.Tech Computer Science & Engineering student at IIIT Hyderabad. I am proficient in many languages including but not limited to:

  • C/C++
  • Python
  • Bash
  • Java
  • Android

I have completed many projects and you can check them out on my Github Profile. Last year, I contributed to OpenMRS (Made 3 commits to the OpenMRS Radiology Module and 1 commit to OpenMRS Reference Application both in Java) but unfortunately the project I was applying for got dropped. I have also completed all the coding challenges required for this task. (Link) I believe that I am proficient enough to complete my project successfully.

Non-Summer-of-Code plans

None. Just a week of vacation as mentioned above.