User:Venkat1997/proposal

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Venkat Parthasarathy

E-Mail: venkat.p1997@gmail.com

IRC: venkat

Why am I interested in machine translation?

I come from India, where there exists 22 scheduled languages and almost every other state speaks a different language. From a young age, I was interested in how to translate between languages because it would really help to travel between states in a country like India. I was introduced to machine translation by a ML course at my college and I was fascinated by how the complex process of translation between languages was neatly encapsulated in probabilistic models. Recently, I also read an article on how Google had vastly improved its translation quality by applying Artificial Neural Networks to machine translation. Upon reading Google's work, I became even more excited about this field and its far reaching applications. I wanted to get involved in a project which would aid in helping me understand more about this subject. I believe that the project I have chosen for GSoC will help me in achieving that goal as I would be working on the one of the most important aspects machine translation pipeline of Apertium directly, generating lexical rules. Being a computer science student, it would also be a good exercise in programming and software engineering practices.

Why is it that I am interested in Apertium?

Apertium is one of the few translation platforms that has both a helpful community and detailed documentation. It allows users to understand what is actually happening behind machine translation. For example, I spent some time on IRC discussing with Unhammer on how to add more custom lexical rules and evaluate their peformance. He promptly explained what was happening and also pointed me to some wiki pages for more details. I was able to understand, first-hand, how critical tasks were performed and how a language pair is actually created. Many experiences like these made Apertium a very likeable organization. Not only is it one of the most robust machine translation platforms but also its documentation and community enables anyone to learn what machine translation is.

Which of the published tasks am I interested in? What do I plan to do?

I am interested in the task User-friendly lexical selection training. I plan to extend Nikita Medyankin's work on the driver script by refactoring his code and removing unnecessary scripts, adopting a more user friendly yaml config file, making the installation of third-party tools easier (maybe even removing some of them that are currently used), providing regression tests to this driver script and finally testing the work by running on some language pairs that don't have many rules and adding those rules to the pair if it improves quality.

Reasons why Google and Apertium should sponsor it

  • Make the process of generating lexical rules a lot more user-friendly.
  • Current script is difficult to understand and modify. Upon completion of project, further improvements on generation of lexical selection rules is also made easier.
  • Improvements to current language pairs can be performed effectively.