User:Vaydheesh/Proposal

From Apertium
< User:Vaydheesh
Revision as of 13:17, 9 April 2019 by Vaydheesh (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSoC Proposal : Python API/library for Apertium
[edit]

Basic Details[edit]

Name Lokendra Singh
Email Address lokendras1998@gmail.com
IRC Nick loke98
Country & TimeZone India (UTC + 5:30)
Link to Gihub https://github.com/vaydheesh


Why am I interested in Machine Translation?[edit]

The broader perspective:

I belong to a diverse country, India, where "Every two miles the water changes, every four miles the speech". Having encountered many dilects of Hindi language such as Shauraseni, Hindustani, Braj Bhasha, Haryanvi, Bundeli, Kannauji, Awadhi, Bagheli, Chhattisgarhi, Bombay Hindi. Due to so much of variation in a language, linguistics has always fascinated me. Upon combining this with my passion of python and desire for contributing to open source community, Apertium is my choice for GSoC 2019.


Why is it that I am interested in Apertium?[edit]

During my projects on Machine Learning, I came across Natuaral Language Processing, which opened the world of Computer Linguistics for me. While browsing the list of organisations, Apertium Machine Translation caught my eye. It has a nice combination of coding challenges and linguistics. I have been using FREE softwares for past few years and now I want to start contributing to community. And Apertium seems to be the right choice to me.


Which of the Ideas List am I interested in?[edit]

Initially, I was confused between Unsupervised Learning and Python API, but I have decided upon the Python API/library for Apertium.


Why should Google and Apertium sponsor the project of Python API for Apertium?[edit]

Apertium is written in C++ which has very high performance, with high level of abstraction and is well standardized, however, it has few shortcomings. It is not so much beginner friendly and writing User-Interfaces in C++ is cumbersome. Python on the other hand, has a lot of features. Python has interpreted high-level programming environment. A python wrapper in SWIG combined with Jupyter Notebooks can provide flexibility, ease of installation, debugging, testing.


How and who will benefit from this project?[edit]

The project would bring a lot of developers at ease. Python is a high-level language with a lot of features that make it easier to grasp for developers. A lot of people like to use Python Jupyter Notebooks , and a Python module would increase the user community. Also the installation process of Apertium can be simplified by making it available on PyPI. This would also open the Apertium Library to a large user base on Microsoft Windows™. Hence I believe that if Apertium has a Python API, it would be helpful to a large community of developers, linguists, computational linguistics and all people keen on using the wide range of linguistic tools that we provide.


Coding Challenge[edit]

I've worked on Coding challenge 1, a Working installation of apertium via a setup.py file in a Windows environment. The Coding challenge was really interesting to work on. Though it seemed pretty easy, it had its own set of hidden challenges. I had to get familiar with Apertium Bash Helper Script, and the underlying binaries that it was using. I had to add Apertium Binaries to Process' Path, without permanently polluting the User's Environment Variables. Some tweaks were required in the existing code base to ensure that the Apertium-Python Module worked out of box, without creating any issues for its user.

While working on this Coding Challenge I was able to get familiar with the Apertium Code Base. In order to create this setup.py file, I had to understand the entire Apertium Python project, to ensure that all the minor tweaks were compatible with existing code, and didn't result into some unexpected errors.

As of now all the checks are completely passing, and waiting to be merged by an organisation member. Link to Pull Request


Detailed project plan and workflow[edit]

1. Tools To Be Used As suggested in the Ideas List, I plan to use SWIG. The Simplified Wrapper and Interface Generator is an open-source software tool used to connect computer programs or libraries written in C or C++ with scripting languages, in this case Python. The current implementation calls the Apertium Binaries as subprocess, which has it own share of over head, slowing down the translation process. SWIG can be used to create a wrapper on C++ files and generate modules that can be imported in python files. This shall provide us with speed of C++ and ease of usability of Python. Flowchart describing the process of generating python wrapper SWIG.png

2. Timeline

Goals for the various phases:

PHASE OBJECTIVE OF PHASE
Community Bonding Period
  • Understanding the lttoolbox, and other dependencies of Apertium Python
Phase 1
  • Create SWIG Interface files and shared libraries from them for Morphological Analysis and Generation
Phase 2
  • Create SWIG Interface files and shared libraries from them for Performing Translations, alongwith a setup.py for various Linux Distributions
Phase 3
  • Publish the package on PyPI, with Jupyter Notebooks, and Documentation on Apertium Wiki


3. Bi Weekly Goals:


WEEK AND DATE TASK EXPLANATION
Community Bonding Period
  • The current code has the unnecessary overhead of calling the lttoolbox binaries, which has its cost. The process can be made faster by importing the C++ code as libraries in the python module, to reduce the time taken in computation. I plan to work with lttoolbox maintainers and understand the working of lttoolbox, to smoothen the process of generating Interface swig files.
  • Understanding various data types and arguments used in the code.
  • Reading SWIG Documentatin to handle various cases while generating Interface file
Week 1&2, 27 May to 9 June
  • Analysis, Generation and Translation are working well with the current implementation. I plan to implement each of these features individually.
  • Write the Interface files for C++ files required for Analyzing.
  • Generate shared libraries from Interface files.
  • Generate python module, apertium.analysis from Interface files.
  • Write unittest for python module generated.
Week 3&4, 10 June to 23 June
  • Repeat the above steps for the Morphological Generation, apertium.generation
Week 5&6, 24 June to 7 July
  • After getting in touch with the codebase, implementing the same for Translation, apertium.translation.
  • Making a super wrapper for the Analyzer, Generator, Translator, apertium.__init__.
Week 7&8, 8 July to 21 July
  • Write python script(to make cross platform) to generate shared libraries.
  • I plan to use g++ on GNU/Linux and mingw on windows to generate shared libraries and DLL files respectively.
  • Modify setup.py to make it compatible with various distros, (Debian and RedHat)
  • ./setup.py install will install the Apertium Package, depending upon the Disto being used by user
Week 9&10, 22 July to 4 August
  • Prepare Jupyter Notebooks for users.
  • Publish the code on PyPI, to make it pip installable.
Week 11&12, 5 August to 18 August
  • Create documentaion and tutorials for the prepared codebase on apertium wiki.
  • Add usage in markdown and include it in README.md.
  • Taking reviews of alpha testing and make necessary changes.
  • Fix errors reported by users.

4. Montly Deliverables

Deliverable EXPLANATION
Deliverable 1
  • Pythonic wrapper for both Morphological Analyzer and Morphological Generator
Deliverable 2
  • Pythonic script to automate the build process
  • Pythonic Wrapper for Performing Translation and a super wrapper for Analyzer and Morphological and Translation.
Deliverable 3
  • Cross platform setup.py
  • Pip installable apertium-python
  • Documentatin and tutorials on apertium wiki and Usage either in Markdown, with examples in form of Jupyter Notebook


Examinations[edit]

My theory exams should be over by 4th week of May(25th May, 2019). My practical exams would be conducted in the following two weeks, i.e. 27th May, 2019 to 8th June, 2019. This might reduce my efficiency in the first two weeks of internship. Hence I plan to get the initial work started before the commencement of Coding Period(27th May, 2019), during the community bonding period. This should provide me with the head start required for timely submission of deliverables of the project. I am expecting that working on Morphological Analyzer, might take its share of time, being the first one to be implemented. To ensure sticking to my timeline I plan to work over time, allowing me to absorb the unexpected delays due to my examinations.

About me: Education and Experience[edit]

I am a Final Year student at Maharaja Agrasen Institute Of Technology, Delhi, India, pursuing B.Tech in Mechanical And Automation Engineering. I’ve worked with C++(Competetive Programming) and Python(Machine Learning and Web Scraping). And I have been using Arch Linux as my primary operating system for past 4 years. With this past experience, I am confident that I would be able to make a decent cross platform Pythonic API


Non-Summer Of Code Plans[edit]

I have my college vacations during the months of Google Summer of Code. And I would be able to devote around 40 man hours every week. I have no vacation plans.


Post GSoC Plans[edit]

1. Create SWIG wrapper for remaining lttoolbox files.

2. Convert the remaining codebase into python modules.

3. Work on the remaining portion and implement it in