Difference between revisions of "User:Techievena/Proposal"

Revision as of 17:37, 15 March 2018

Contact Information

Name: Abinash Senapati
IRC Nickname: Techievena
E-mail address: techievena@users.sourceforge.net
Website: https://techievena.github.io
Github: https://github.com/Techievena
Gitlab: https://gitlab.com/Techievena
Time Zone: UTC +5:30 (IST-India)
Location: Bhubaneswar, India or Bangalore, India
GSoC blog URL: https://techievena.github.io/categories/GSoC
School/Degree: B.Tech. in Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, India
Expected Graduation Year: 2019

Code Sample

I have been actively involved with open source organisations in some or other way for more than a year now. During this time I have gained a lot of experience regarding the do’s and the don’ts of the open source communities and have also helped the open source community in my university and at coala to maintain integrity and helped a number of new developers bootstrap their open source journey. I have also been a Google Code-in mentor for coala organisation during 2017-18.

I have previous experience of working with apertium and have got my commits merged into the main repository of apertium. I have improved the apertium-lint, a GSoC 2016 project of Apertium:

apertium-lint repository

I have also been involved with coala for more than a year now and am a member of the aspects team of coala. I am also a member of the open source development team of gitmate. All my contribution to the open source project are listed below:

Other significant contributions I would like to mention:

I have been a constant and daily reviewer and have helped in newcomer orientation by helping them at times when they get stuck. The link to reviews:
- Github reviews
I have also opened up significant number of issues in some repositories:
- Github issues
- Gitlab issues
I have created a NuGet package for LLVM 3.4.1 for the use of coala and am currently working to improve that as a part of an issue. Here is the link for that package.

University projects/ Personal projects

Interest in machine translation

It is quite uncomfortable to regard language just as a communication method. Huge experiences, history and views of the world human being have been soaked deeply into a language since the beginning of the world. I come from country that houses over 1600 languages of which around 150 have a sizeable speaking population even today. Most of them are on the verge of extinction and with ordered morphological rules, RBMT is the promising option for their restoration.

As I have taken up Formal Languages and Automata Theory course this semester in my university, I am familiar with the concepts of Deterministic Finite Automata, Non-deterministic Finite Automata, Weighted Transducers, Finite State Machines and other concepts essential to take up a project in machine translation.

Interest in Apertium project

Apertium is a rule-based machine translation platform. I began working on Apertium about an year ago to modify apertium-lint tool to support multiple configuration files, as it was essential for writing the ApertiumlintBear module for coala.

From then, I wanted to explore the territory of linguistics and machine translation. Apertium is quite old and a popular project and is even used by Wikimedia for translation purposes. It is a very active open-source organisation in which maintenance and development cycle are pretty regular and therefore it is a great learning platform for noobs to learn from the veterans through interaction and contribution.

Project Info

Proposal Title

Apertium: Extend lttoolbox to have the power of HFST
Lttoolbox is a toolbox in the Apertium project for lexical processing, morphological analysis and generation of words. With the current format of Apertium it is quite hard to perform complex morphological transformations with lttoolbox. This project aims on extending the capability of performing morphographemics and adding lexical weights to the lttoolbox transducer in order to enable more complex translations with the transducer. By the end of this project, lttoolbox will be sophisticated enough to support most of the language pairs.

Proposal Abstract

The aim of this project is to implement the support for morphographemics and weights in the lttoolbox transducer. The proposal focuses on extending lttoolbox to perform the complex morphological transformations currently done in HFST and writing a module that translates the current HFST format to the new lttoolbox format.

Possible Mentor(s):- User:Mlforcada, User:TommiPirinen, User:Unhammer
Language(s) Used:- C++, Python, XSLT, XML

Proposal Description

Motivation / Benefit to the society

Currently, most languages pairs in Apertium use the lttoolbox due to its straightforward syntax and has some very useful features for validation of lemmas, but it is not suited to agglutinative languages, or languages with complex and regular morphophonology. This void thus makes it harder for new developers for they must necessarily learn working with both the platforms to improve or start a new language pair with Apertium. Here comes the necessity for having a single module in Apertium for all the language pair translations to make the project more developer friendly.

Reason for sponsorship

I am a volunteer and am working in my free time to improve Apertium. I am a student and I need to earn money during the summers one way or another. A sponsorship can reinforce my self-confidence and give me greater powers of resilience. The writing of a preprocessor for lttoolbox is a huge effort, as a big rework is required for this, but with high benefits in the area of machine translation. I would like to flip bits instead of burgers this year. Making a full job out of this means that I can focus all of my energy and dedication to this project, and this effort needs full time work to be successful.

Implementation

Currently, the basic workflow of morphological transformations in Apertium is:

The morphological analysis and generation in lttoolbox which specifies mapping between morphological and morphophonological form of words is a straightforward process. It merely constructs a tree of paradigm definitions and iterates through all its branches to join all the possible lemmas and tags present in the dictionary to construct all the word-forms for a root word. There is no facility for rule based transformations in lttoolbox to modify the morphemes when they are joined together, which makes it hard to construct a dictionary for languages with complex phonological rules.

Two level rule formalism is based on concept, that there is a set of possible character pairs or correspondences in languages, and rules specify contexts in which these correspondences appear. These come handy in situations where lexicons have created a kind of a morphophonemic level, that defines word-forms as patterns constructed from regular letters and possibly morphophonemics, which may vary on context rules.

The implementation of the proposal will go through the following steps:

Writing a converter from HFST format to lttoolbox format

There is a close resemblance between the currently used ‘.dix’ format in lttoolbox and the ‘.lexc’ file in HFST format. This is due to the fact that both of these files describe how morphemes in a language are joined together. They are basically the doppelganger of each other and only differ in the terminologies used. We need to parse through the lexc files and store the entries in the form of a python dictionary and rewrite them with appropriate xml tags to construct a corresponding dix file. This module can be further modified once we make any further alterations in the format of lttoolbox to support morphographemics and weights. I am already working on this module as a part of my coding challenge.

Adding format for morphographemics

As currently lttoolbox doesn’t support any kind of rule based character alterations it is practically not possible to write dictionary for languages with complex phonologies. Rules in hfst are declared in a separate text file and hfst-twolc combines lexicons based on these rules to produce a lexical transducer. Twol compilation is quite a complex task with a high tendency for every minute details to fail very easily. Thus enable two level rule formalism in lttoolbox is not efficient and advisable. Thus we can fork essential features from the currently existing hfst-twolc for enabling rule based morphology in lttoolbox.

We also need to come up with a format for writing rules in lttoolbox that is easy to interpret and not misleading. This format will be decided after a thorough discussion with mentors and other team members.

Writing module for lttoolbox to support morphographemics

A new module in lttoolbox will be necessary for handling the new rule based morphographemics and finally combine the rules with the lexicons to generate results. This module will parse through the rules declared in the dix file and process them with the help of twol formalism backend support provided by hfst library. Finally it has to combine the results to generate a lexical transducer. Also the previous modules like lt-expand and lt-trim need to be modified to generate proper results after the new alterations in lttoolbox.

Adding OpenFST as backend to support weights

OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers. Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The weights can be used to represent the cost of taking a particular transition. Often a weighted transducer is used to represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be optimized by determinization and minimization, models can be applied to hypothesis sets or cascaded by finite-state composition, and the best results can be selected by shortest-path algorithms. The OpenFst library is a C++ template library which can be incorporated easily by including appropriate header files. Thus OpenFst can be used as a backend in lttoolbox to support weight based analysis. There are a tonne of functions in the library to implement transducer composition, determinization and minimization using weighted acceptor versions of these algorithms.

Assigning weights using gold-standard corpora

Every finite state transducer must specify an initial state, the final weights, arc and epsilon counts per states, an fst type name, the fst's properties, how to copy, read and write the transducer, and the input and output symbol tables. To assign weights to states and arcs in the lttoolbox transducer, an unsupervised training method will be implemented. They will be calculated by using a standard gold-tagged parallel corpora set up fully in a direction we intend to train for and at least until the pre-transfer stage in the opposite direction. A method similar to estimation of rules in lexical selection process will be deployed to train the algorithm to estimate weight values in the corpora. The training algorithm will be integrated with the filters present in the openfst library to determine the matches which will be allowed to proceed in composition or handle correct epsilon matching.

Backporting

As of now there is no concrete prototype of what the new dix file and lttoolbox will look like after the completion of this project, but I am quite sure that changes in the codebase and the new format will be drastic after addition of the new features. Lttoolbox is an integral part of Apertium and we cannot afford this module to generate errors during morphological analysis and generation. Thus, we have to ensure that lttoolbox remains compatible with the older version of dix files.

Documentation

When working in a collaborative environment, maintaining proper documentation is very important. Also users and new developers rely upon documentation tutorials and wiki to learn about the changes in codebase. Therefore a proper documentation which includes proper comments in between the code as well, has to be written that will help users and developers familiarize with the the new developments in the lttoolbox and their usage.

Testing and CI builds

In software development and open source projects, automated testing is very important to maintain the huge codebase. A 100% code coverage policy for tests must be followed for having a completely error-free and efficient code. Apertium also uses Travis CI for continuous integration builds after its migration to Github platform. Test modules allow maintainers to keep a proper track of the existing codebase and prevent any non-mergeable code getting merged into main code-base, to prevent build errors. Testing modules are therefore important and hence it is necessary to write tests for the addition of any chunk to the codebase.

Test modules have to be written for the new modifications in the lttoolbox transducer and the lexc2dix module along with any possible modifications in the existing python test files.

Testing and documentation will always go on side by side along with the implementation of other milestones.

Road to the future

Both the lexical weighting and implementation are complex processes and have high chances of failing in the first iteration of their implementation. For developers to benefit from the support of morphographemics and weights in lttoolbox, it is necessary to make these implementations robust enough to handle all kind of errors and fallbacks. The major stone left unturned after this project is the case specific debugging and error handling mechanism which will take some time reiterating through the entire process and fixing the issues that may arise after the test run of the new lttoolbox.

Timeline

MILESTONES

Community Bonding Period (April 23 - May 13)

Do an extensive research on the currently used architecture for morphological analysis. I will spend most of the proposal review period and the community bonding period on improving and documenting the lexc2dix module and reading materials for the project. As I am targeting to have a full fledged implementation of the proposal, it will require me to go through spectei’s tutorial and Two-level compiler book thoroughly during this period to have a better understanding of the twolc rules and weighted lexicons implementation.

All the possible changes and modifications in proposal and their implementation details will be discussed with mentors, and changes will be made accordingly. I will also work on the existing issues to gain a stronger hold on the codebase of hfst and lttoolbox. All possible modifications in the timeline schedules will be made during this period.

Coding Phase I (May 14 - June 10)

Week 1 (May 14 - May 20)

In this week, most of the work will be forking the essential features from hfst to be used for handling twol formalism in the lttoolbox.
As hfst already has a working implementation of two level rule formalism, we can use it as library and build the twol implementation in lttoolbox on the top of it.

Week 2 (May 21 - May 27)

Now we need some kind of validation in lttoolbox for the new additions in the dix files, to prevent unwanted additions. We can make appropriate changes in the apertium-lint and in the apertium validation script to check the format of dictionary.
Proper debugging and refinements will be done to have a well functioning twol handling module in lttoolbox.
Well by the end of the second week, the refined version of the module for handling rules in lttoolbox transducer will be available. The GSoC blog will be written describing about this module and how it is going to help.

Week 3 (May 28 - June 3)

Now that we have a module for handling rules in lttoolbox, we need to align it properly with the semantics of lttoolbox.
A proper parsing library has to be written to parse through the definition of the rules in the dix file and generate lexicons corresponding to them.

Week 4 (June 04 - June 10)

Any possible improvements and modifications to the rule parsing library will be done.
Tests and documentation for the changes in code will be written during this period. Proper tutorial describing about the lt-twol module will be properly documented.
This will basically be the buffer week for the first coding phase. Any minor issues will be fixed in this period to ensure merge ability of the code into the core repository. As this is the last week of the first coding phase, I will blog my GSoC experience too.

Coding Phase II (June 11 - July 8)

Week 5 (June 11 - June 17)
Week 6 (June 18 - June 24)
Week 7 (June 25 - July 1)
Week 8 (July 2 - July 8)

Coding Phase III (July 9 - August 14)

Week 9 (July 9 - July 15)
Week 10 (July 16 - July 22)
Week 11 (July 23 - July 29)
Week 12 (July 30 - August 5)
Week 13 (August 6 - August 14)

GSoC timeline is all about commitment and I really understand that doing things properly is more important than doing more things.

Therefore I didn’t mention to support weights throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking in my GSoC timeline. The addition of weights will be mostly basic and will be just for monolingual dictionaries. After discussions with my mentors I will try to improve my timeline by including some more tasks or discarding some, to make the best out the GSoC period.

Coding Challenges

Install Apertium.
lexc2dix → hfst lexc/twolc to monodix++ converter.

Other Commitments

I consider GSoC as a full time job and do not have any other commitments as such. I only have my departmental summer course during this period which won’t take much of my time. I am going to spend most of my time coding and reading wiki and articles for my project. I am willing to contribute 40+ hours per week for my GSoC project and may extend that if required to fulfill the milestones of my timeline. In case of any unplanned schedules or short-term commitments, I will ensure that my mentors are informed about my unavailability beforehand and make suitable adjustments in my schedule to compensate for that. My university has end semester examinations in the beginning of the month of May and I will take a short vacation from 10th May to 15th May. So, I will be a bit occupied during the Community Bonding Period, but surely I’ll ensure that it won’t affect my dedication for GSoC.

@@ Line 142: / Line 142: @@
 <li><b>Week 13 (August 6 - August 14)</b></li>
 </ul>
+<p>GSoC timeline is all about commitment and I really understand that doing things properly is more important than doing more things.</p>
+<p>Therefore I didn’t mention to support weights throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking in my GSoC timeline. The addition of weights will be mostly basic and will be just for monolingual dictionaries. After discussions with my mentors I will try to improve my timeline by including some more tasks or discarding some, to make the best out the GSoC period.</p>
 == Coding Challenges ==

Difference between revisions of "User:Techievena/Proposal"

Revision as of 17:37, 15 March 2018

Contents

Contact Information

Code Sample

Interest in machine translation

Interest in Apertium project

Project Info

Proposal Title

Proposal Abstract

Proposal Description

Motivation / Benefit to the society

Reason for sponsorship

Implementation

Road to the future

Timeline

MILESTONES

Coding Challenges

Other Commitments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools