Difference between revisions of "User:Arghya1998"

From Apertium
Jump to navigation Jump to search
Line 32: Line 32:
 
|}
 
|}
   
== Why am I interested in Machine Translation ? ==
+
== Why am I interested in Machine Translation? ==
   
 
'''The broader perspective:'''
 
'''The broader perspective:'''
   
 
Being from a diverse country like India, with over 22 officially registered languages and over 1500 mother tongue languages (150 of them are sizeable), I’ve always been curious as to how
 
Being from a diverse country like India, with over 22 officially registered languages and over 1500 mother tongue languages (150 of them are sizeable), I’ve always been curious as to how
languages serve as the basic entity of interaction. As a kid, I’ve lived in various places in India and hence i’ve had the chance to closely interact with people of different lingual
+
languages serve as the basic entity of communication. During my childhood, I have lived in various places in India and hence I have had the chance to closely interact with people of different lingual backgrounds and in the process I ended up learning quite a few languages including Hindi, Bengali, English, Tamil, Oriya. The language diversity in my country is fascinating, but with it comes a lot of problems in communication and I believe that efficient machine translation can aid a lot of these problems and breaking the “language barrier” across not just the country and the globe and connect people better.
backgrounds and in the process i ended up learning quite a fewlanguages including Hindi, Bengali, English, Tamil, Oriya. The language diversity in my country is fascinating, but with it comes
 
a lot of problems and i believe that efficient machine translation can aid solving a lot of these problems and breaking the “language barrier”
 
across the country and the globe and connect people better.
 
   
   
 
'''Academic Interests:'''
 
'''Academic Interests:'''
   
I am currently pursuing my B.Tech in Computer Science + M.S by Research in Computational Linguistics Dual Degree program at IIIT-Hyderabad, India. A good portion of our academic focus is on Machine Translation and I really find it an interesting area to work on. So working with apertium will help me nurture my Computational Linguistics skills as well as give me a chance to help the community with whatever contribution i’m capable of making.
+
I am currently pursuing my B.Tech in Computer Science + M.S by Research in Computational Linguistics Dual Degree program at IIIT-Hyderabad, India. A good portion of our academic focus is on Machine Translation and I find it a really interesting area to work on. So working with apertium will help me nurture my Computational Linguistics skills as well as give me a chance to give back to the community with some solid contribution.
   
   
   
== Why is it that I am interested in Apertium ? ==
+
== Why is it that I am interested in Apertium? ==
   
Being a student, with primary academic focus on Computational Linguistics, Apertium happens to be one of the important tools that I use for my university assignments.The Apertium projects have a nice blend of Linguistic and Coding tasks and that makes the projects interesting to me. Also as a part of the long term goal of contributing to the community, I think contributions to Apertium would make a significant impact on the Computational Linguistics community all around the globe and that further motivates me to work for Apertium
+
Being a student, with primary academic focus on Computational Linguistics, Apertium happens to be one of the important tools that I use for my university assignments.The Apertium projects provide a nice blend of linguistic and coding tasks and that makes the projects interesting to me. Also as a part of the long-term goal of contributing to the community, I think contributions to Apertium would make a significant impact on the Computational Linguistics community all around the globe and that further motivates me to work for Apertium
   
   
Line 57: Line 54:
 
== Which of the published tasks am I interested in? ==
 
== Which of the published tasks am I interested in? ==
   
To me all the published tasks seem to be interesting and hence it becomes difficult to choose only one. But I have been able to narrow down to the project called Python API/library for Apertium
+
To me, all the published tasks seem to be interesting and hence it was difficult to choose only one. But I have been able to narrow down to the project which I like the most. It is called Python API/library for Apertium
   
   
   
== Why should Google and Apertium sponsor the project of Python API for Apertium ? ==
+
== Why should Google and Apertium sponsor the project of Python API for Apertium? ==
   
The Apertium code base is primarily written in C++. While C++ has a fairly high performance, supports low level systems programming and is fairly available everywhere and reasonably well standardized, however, there are a few shortcomings to it as well. Some of them include the non-interactiveness of c++, the compile/debug/nap cycle and the endless difficulties in extending and modifying the modules. Also, Once the development of a module is done with, certain improvements like writing User-Interfaces and systems integration become really cumbersome in C++. Python on the other hand has a lot of features that c++ doesn’t have. Python has a interpreted high level programming environment. And hence a python wrapper can provide flexibility, interactivity to Apertium’s code base. Also a lot of other features like ease of debugging, ease of testing, and rapid prototyping.
+
The Apertium code base is primarily written in C++. While C++ has a fairly high performance, supports low-level systems programming and is fairly available everywhere and reasonably well standardized, however, there are a few shortcomings to it as well. Some of them include the non-interactiveness of C++, the compile/debug/nap cycle and the endless difficulties in extending and modifying the modules. Also, Once the development of a module is done with, certain improvements like writing User-Interfaces and systems integration become really cumbersome in C++. Python, on the other hand, has a lot of features that c++ doesn’t have. Python has an interpreted high-level programming environment. And hence a python wrapper can provide flexibility, interactivity to Apertium’s code base. Also a lot of other features like ease of debugging, ease of testing, and rapid prototyping.
   
   
Line 69: Line 66:
 
== How and Who will benefit from this project? ==
 
== How and Who will benefit from this project? ==
   
The project would bring a lot of developers at ease as python is a high level language with a lot of features that make it easier to grasp for developers, and would increase the scalability of apertium in the future, also a lot of people like to use jupyter notebooks and python, and hence I believe that if apertium has a python API, it would be helpful to a large community of developers, linguists, computational linguistics and all people keen on using a wide range of linguistic tools.
+
The project would bring a lot of developers at ease as python is a high-level language with a lot of features that make it easier to grasp for developers, and would increase the scalability of apertium in the future, also a lot of people like to use jupyter notebooks and python, and hence I believe that if apertium has a python API, it would be helpful to a large community of developers, linguists, computational linguistics and all people keen on using a wide range of linguistic tools.
   
   
Line 82: Line 79:
 
* [[Apertium/Apertium]]
 
* [[Apertium/Apertium]]
   
a.) The modules should be python importable,the pythonic usage would be as follows:
+
a.) The modules should be python importable, the pythonic usage would be as follows:
from apertium.lttoolbox import trasducer
+
from apertium.lttoolbox import transducer
 
b.) The modules should be nested
 
b.) The modules should be nested
 
apertium.lttoolbox.transducer
 
apertium.lttoolbox.transducer
Line 128: Line 125:
 
| COMMUNITY BONDING PERIOD
 
| COMMUNITY BONDING PERIOD
 
|
 
|
* START : April 23rd
+
* START: April 23rd
 
* END: May 13th
 
* END: May 13th
 
|
 
|
Line 136: Line 133:
 
| WEEK 1
 
| WEEK 1
 
|
 
|
* START :May14th
+
* START: May 14th
* END : May 20th
+
* END: May 20th
 
|
 
|
 
* Setting up Disutils for the lttoolbox module and making the basic layout importable in Python.
 
* Setting up Disutils for the lttoolbox module and making the basic layout importable in Python.
Line 143: Line 140:
 
| WEEK 2
 
| WEEK 2
 
|
 
|
* START : May 21st
+
* START: May 21st
 
* END: May 27th
 
* END: May 27th
 
|
 
|
 
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface
 
* Making explicit declarations of Constants and Enumerations of the module in SWIG interface
* Testing all pointer based data manipulation for any errors. (A common problem that might occur with swig bindings)
+
* Testing all pointer-based data manipulation for any errors. (A common problem that might occur with swig bindings)
 
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file
 
* Looking for Data Members that need to be made read-only and making necessary changes in the interface file
* Identifying Static Class members,Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues , but the other issues have to be taken care of manually.
+
* Identifying Static Class members, Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually.
 
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces)
 
* Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces)
 
|- style="background-color:#ccd3ff;"
 
|- style="background-color:#ccd3ff;"
 
| WEEK 3
 
| WEEK 3
 
|
 
|
* START : May 28th
+
* START: May 28th
 
* END: June 3rd
 
* END: June 3rd
 
|
 
|
* In order to create wrappers, one has to tell SWIG to create wrappers for a particular template instantiation. Hence all the templates have to be explicitly declared specific to the data being manipulated in them.
+
* In order to create wrappers, one has to tell SWIG to create wrappers for a particular template instantiation. Hence all the templates have to be explicitly declared specific to the data being manipulated in them.
* C++ Reference Counted Objects: Referencing and Dereferencing of objects have to be taken care of so that no error occurs, another place where SWIG isn’t smart enough.
+
* C++ Reference Counted Objects: Referencing and Dereferencing of objects have to be taken care of so that no error occurs, another place where SWIG isn’t smart enough.
* Handling C++ overloaded functions: Overloading support is not quite as flexible as in C++. Sometimes there are methods that SWIG can't disambiguate, if such errors appear then they have to be taken care of manually in the interface file of the wrapper.
+
* Handling C++ overloaded functions: Overloading support is not quite as flexible as in C++. Sometimes there are methods that SWIG can't disambiguate if such errors appear then they have to be taken care of manually in the interface file of the wrapper.
 
|- style="background-color:#ccd3ff;"
 
|- style="background-color:#ccd3ff;"
 
| WEEK 4
 
| WEEK 4
Line 166: Line 163:
 
* END: June 10th
 
* END: June 10th
 
|
 
|
* Implement Director Classes,No mechanism exists to pass method calls down the inheritance chain from C++ to Python. In particular, if a C++ class has been extended in Python, these extensions will not be visible from C++ code. Virtual method calls from C++ are thus not able access the lowest implementation in the inheritance chain. There exists a feature implemented in SWIG called directors, The job of the directors is to route method calls correctly, either to C++ implementations higher in the inheritance chain or to Python implementations lower in the inheritance chain.
+
* Implement Director Classes, No mechanism exists to pass method calls down the inheritance chain from C++ to Python. In particular, if a C++ class has been extended in Python, these extensions will not be visible from C++ code. Virtual method calls from C++ are thus not able to access the lowest implementation in the inheritance chain. There exists a feature implemented in SWIG called directors, The job of the directors is to route method calls correctly, either to C++ implementations higher in the inheritance chain or to Python implementations lower in the inheritance chain.
* Writing c++ helper functions:Sometimes the SWIG module misses bits of functionality because there is no easy way to construct and manipulate a suitable datatype, for those cases c++ helper functions need to be written.
+
* Writing c++ helper functions: Sometimes the SWIG module misses bits of functionality because there is no easy way to construct and manipulate a suitable datatype, for those cases c++ helper functions need to be written.
* Writing High Level Python function to provide a high-level Python interface built on top of low-level helper functions.
+
* Writing High-Level Python function to provide a high-level Python interface built on top of low-level helper functions.
* Error Handling:If C++ throws an error, then it is better to convert it into a python exception.
+
* Error Handling: If C++ throws an error, then it is better to convert it into a python exception.
 
|-style="background-color:#f2c1f2;"
 
|-style="background-color:#f2c1f2;"
 
| WEEK 5
 
| WEEK 5
 
|
 
|
* START : June 11th
+
* START: June 11th
 
* END: June 17th
 
* END: June 17th
 
|
 
|
Line 180: Line 177:
 
| WEEK 6
 
| WEEK 6
 
|
 
|
* START : June 18th
+
* START: June 18th
 
* END: June 24th
 
* END: June 24th
 
|
 
|
Line 187: Line 184:
 
| WEEK 7
 
| WEEK 7
 
|
 
|
* START : June 25th
+
* START: June 25th
 
* END: July 1st
 
* END: July 1st
 
|
 
|
Line 194: Line 191:
 
| WEEK 8
 
| WEEK 8
 
|
 
|
* START : July 2nd
+
* START: July 2nd
* END,: July 8th
+
* END: July 8th
 
|
 
|
 
* Ref Week 4
 
* Ref Week 4
Line 201: Line 198:
 
| WEEK 9
 
| WEEK 9
 
|
 
|
* START : July 9th
+
* START: July 9th
* END : July 15th
+
* END: July 15th
 
|
 
|
 
* Testing the modules built and starting the documentation.
 
* Testing the modules built and starting the documentation.
Line 208: Line 205:
 
| WEEK 10
 
| WEEK 10
 
|
 
|
* START : July 16th
+
* START: July 16th
 
* END: July 22nd
 
* END: July 22nd
 
|
 
|
Line 215: Line 212:
 
| WEEK 11
 
| WEEK 11
 
|
 
|
* START : July 23rd
+
* START: July 23rd
* END,: July 29th
+
* END: July 29th
 
|
 
|
 
* Taking reviews of beta testing and implementing changes if any.
 
* Taking reviews of beta testing and implementing changes if any.
Line 222: Line 219:
 
| WEEK 12
 
| WEEK 12
 
|
 
|
* START : July 30th
+
* START: July 30th
 
* END: August 5th
 
* END: August 5th
 
|
 
|
Line 247: Line 244:
 
== About me: Education and Experience ==
 
== About me: Education and Experience ==
   
I am a sophomore at IIIT-Hyderabad, India, pursing my Dual-Degree in Computer Science and Computational Linguistics. I’ve worked with C++ and Python closely in a lot of projects and I take keen interest in machine learning as well.I usually love building fun applications. The details of my work experience can be found [[here]].
+
I am a sophomore at IIIT-Hyderabad, India, pursuing my Dual-Degree in Computer Science and Computational Linguistics. I’ve worked with C++ and Python closely in a lot of projects and I take keen interest in machine learning as well.I usually love building fun applications. The details of my work experience can be found [[here]].
   
 
</span>
 
</span>

Revision as of 17:29, 13 March 2018

GSoC Proposal : Python API/library for Apertium

Basic Details

Name Arghya Bhattacharya
Email Address arghya.b@research.iit.ac.in
Alternate Email Address arghyatiger@gmail.com
IRC Nick arghya[m]
Mobile +91 9831325363
TimeZone UTC + 5:30
Link to Gihub https://github.com/arghyatiger
Link to Gitlab https://gitlab.com/arghyatiger

Why am I interested in Machine Translation?

The broader perspective:

Being from a diverse country like India, with over 22 officially registered languages and over 1500 mother tongue languages (150 of them are sizeable), I’ve always been curious as to how languages serve as the basic entity of communication. During my childhood, I have lived in various places in India and hence I have had the chance to closely interact with people of different lingual backgrounds and in the process I ended up learning quite a few languages including Hindi, Bengali, English, Tamil, Oriya. The language diversity in my country is fascinating, but with it comes a lot of problems in communication and I believe that efficient machine translation can aid a lot of these problems and breaking the “language barrier” across not just the country and the globe and connect people better.


Academic Interests:

I am currently pursuing my B.Tech in Computer Science + M.S by Research in Computational Linguistics Dual Degree program at IIIT-Hyderabad, India. A good portion of our academic focus is on Machine Translation and I find it a really interesting area to work on. So working with apertium will help me nurture my Computational Linguistics skills as well as give me a chance to give back to the community with some solid contribution.


Why is it that I am interested in Apertium?

Being a student, with primary academic focus on Computational Linguistics, Apertium happens to be one of the important tools that I use for my university assignments.The Apertium projects provide a nice blend of linguistic and coding tasks and that makes the projects interesting to me. Also as a part of the long-term goal of contributing to the community, I think contributions to Apertium would make a significant impact on the Computational Linguistics community all around the globe and that further motivates me to work for Apertium


Which of the published tasks am I interested in?

To me, all the published tasks seem to be interesting and hence it was difficult to choose only one. But I have been able to narrow down to the project which I like the most. It is called Python API/library for Apertium


Why should Google and Apertium sponsor the project of Python API for Apertium?

The Apertium code base is primarily written in C++. While C++ has a fairly high performance, supports low-level systems programming and is fairly available everywhere and reasonably well standardized, however, there are a few shortcomings to it as well. Some of them include the non-interactiveness of C++, the compile/debug/nap cycle and the endless difficulties in extending and modifying the modules. Also, Once the development of a module is done with, certain improvements like writing User-Interfaces and systems integration become really cumbersome in C++. Python, on the other hand, has a lot of features that c++ doesn’t have. Python has an interpreted high-level programming environment. And hence a python wrapper can provide flexibility, interactivity to Apertium’s code base. Also a lot of other features like ease of debugging, ease of testing, and rapid prototyping.


How and Who will benefit from this project?

The project would bring a lot of developers at ease as python is a high-level language with a lot of features that make it easier to grasp for developers, and would increase the scalability of apertium in the future, also a lot of people like to use jupyter notebooks and python, and hence I believe that if apertium has a python API, it would be helpful to a large community of developers, linguists, computational linguistics and all people keen on using a wide range of linguistic tools.


Detailed project plan and workflow

1. Detailed Project Goal:

The Goal of the project is to create structured python wrappers for the core modules of apertium, namely:

a.) The modules should be python importable, the pythonic usage would be as follows:

          from apertium.lttoolbox import transducer

b.) The modules should be nested

          apertium.lttoolbox.transducer

c.) The internal usage of the functions should be as follows:

          import apertium.transducer.internal
          t = apertium.transducer.internal.Transducer().insertSingleTransduction()

2. Tool to be used:

For the project, I plan on using SWIG to bind the C++ code. SWIG is a software development tool that simplifies the task of interfacing different languages to C and C++ programs. SWIG is a compiler that takes C declarations and creates the wrappers needed to access those declarations from other languages. Among the other options that I explored for the project are Pyrex, ctypes, SIP, Boost.python.But for projects of the scale of this one, SWIG seems to be the most convenient due to a lot of features explained later in the proposal.

3. Timeline :

Goals for the various phases:

PHASE OBJECTIVE OF PHASE
Community Bonding Period
  • Good Understanding of all the modules, all the intricacies of binding each module and a detailed report of the modules
Phase 1
  • Binding/Testing the Lttoolbox Module
Phase 2
  • Binding/Testing the Apertium Module
Phase 3
  • Documentation of usage of the python modules and library organization of the modules made in previous phases


Week-Wise Goals:

NAME OF PHASE DURATION DETAILS OF PHASE TASKS
COMMUNITY BONDING PERIOD
  • START: April 23rd
  • END: May 13th
  • Playing around with the lttoolbox and apertium modules and using every function and understanding all the flags and arguments of the functions.
  • Reading up on the details of SWIG.Taking inputs from various apertium users on what would be the ideal implementation that they would want.
WEEK 1
  • START: May 14th
  • END: May 20th
  • Setting up Disutils for the lttoolbox module and making the basic layout importable in Python.
WEEK 2
  • START: May 21st
  • END: May 27th
  • Making explicit declarations of Constants and Enumerations of the module in SWIG interface
  • Testing all pointer-based data manipulation for any errors. (A common problem that might occur with swig bindings)
  • Looking for Data Members that need to be made read-only and making necessary changes in the interface file
  • Identifying Static Class members, Python classes had no support for static methods and no version of Python supports static member variables in a manner that SWIG can utilize. Therefore, SWIG generates wrappers that try to work around some of these issues, but the other issues have to be taken care of manually.
  • Resolving namespace problem of SWIG manually(occurs if there are multiple namespaces)
WEEK 3
  • START: May 28th
  • END: June 3rd
  • In order to create wrappers, one has to tell SWIG to create wrappers for a particular template instantiation. Hence all the templates have to be explicitly declared specific to the data being manipulated in them.
  • C++ Reference Counted Objects: Referencing and Dereferencing of objects have to be taken care of so that no error occurs, another place where SWIG isn’t smart enough.
  • Handling C++ overloaded functions: Overloading support is not quite as flexible as in C++. Sometimes there are methods that SWIG can't disambiguate if such errors appear then they have to be taken care of manually in the interface file of the wrapper.
WEEK 4
  • START : June 4th
  • END: June 10th
  • Implement Director Classes, No mechanism exists to pass method calls down the inheritance chain from C++ to Python. In particular, if a C++ class has been extended in Python, these extensions will not be visible from C++ code. Virtual method calls from C++ are thus not able to access the lowest implementation in the inheritance chain. There exists a feature implemented in SWIG called directors, The job of the directors is to route method calls correctly, either to C++ implementations higher in the inheritance chain or to Python implementations lower in the inheritance chain.
  • Writing c++ helper functions: Sometimes the SWIG module misses bits of functionality because there is no easy way to construct and manipulate a suitable datatype, for those cases c++ helper functions need to be written.
  • Writing High-Level Python function to provide a high-level Python interface built on top of low-level helper functions.
  • Error Handling: If C++ throws an error, then it is better to convert it into a python exception.
WEEK 5
  • START: June 11th
  • END: June 17th
  • Ref Week 1
WEEK 6
  • START: June 18th
  • END: June 24th
  • Ref Week 2
WEEK 7
  • START: June 25th
  • END: July 1st
  • Ref Week 3
WEEK 8
  • START: July 2nd
  • END: July 8th
  • Ref Week 4
WEEK 9
  • START: July 9th
  • END: July 15th
  • Testing the modules built and starting the documentation.
WEEK 10
  • START: July 16th
  • END: July 22nd
  • Finishing the documentation of the module and distribute for Beta testing
WEEK 11
  • START: July 23rd
  • END: July 29th
  • Taking reviews of beta testing and implementing changes if any.
WEEK 12
  • START: July 30th
  • END: August 5th
  • Making the super wrapper for the modules
  • Making the module pip installable
  • Update Documentation
WEEK 13
  • START : August 6th
  • END: August 14th
  • Analyse and make bug report for the bugs in the code.
  • Make Final documentation
  • Release Final Module

Coding Challenge

1.)Make the Transducer model python importable


About me: Education and Experience

I am a sophomore at IIIT-Hyderabad, India, pursuing my Dual-Degree in Computer Science and Computational Linguistics. I’ve worked with C++ and Python closely in a lot of projects and I take keen interest in machine learning as well.I usually love building fun applications. The details of my work experience can be found here.