Difference between revisions of "User:Techievena/Proposal"
(→MILESTONES: Final Update)
(→Coding Challenges: Final Update)
|Line 192:||Line 192:|
== Coding Challenges ==
== Coding Challenges ==
<li>[https://github.com/Techievena/lexc2dix lexc2dix] → hfst lexc/twolc to monodix
<li>[https://github.com/Techievena/lexc2dix lexc2dix] → hfst lexc/twolc to monodix converter.</li></ul>
== Other Commitments ==
== Other Commitments ==
Revision as of 16:27, 30 March 2018
- 1 Contact Information
- 2 Code Sample
- 3 Interest in machine translation
- 4 Interest in Apertium project
- 5 Project Info
- 6 Coding Challenges
- 7 Other Commitments
Name: Abinash Senapati
IRC Nickname: Techievena
E-mail address: firstname.lastname@example.org
Time Zone: UTC +5:30 (IST-India)
Location: Bhubaneswar, India or Bangalore, India
GSoC blog URL: https://techievena.github.io/categories/GSoC
School/Degree: B.Tech. in Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, India
Expected Graduation Year: 2019
I have been actively involved with open source organisations in a way or the other for more than a year now. During this time I have gained a lot of experience regarding the do’s and the don’ts of the open source communities and have also helped the open source community in my university in conducting talks and workshops and in coala community to maintain integrity and helped a number of new developers bootstrap their open source journey. I have also been a Google Code-in mentor for coala organisation during 2017-18.
I have previous experience of working with apertium and have got my commits merged into the main repository of apertium. I have improved the apertium-lint tool, a GSoC 2016 project of Apertium:
I have also been involved with coala for more than a year now and am a member of the aspects team of coala. I am also a member of the open source development team of gitmate. All my contribution to the open source project are listed below:
Other significant contributions I would like to mention:
- I have been a constant and daily reviewer and have helped in newcomer orientation by helping them at times when they get stuck. The link to reviews:
- I have also opened up significant number of issues in some repositories:
- I have created a NuGet package for LLVM 3.4.1 for the use of coala and am currently working to improve that as a part of an issue. Here is the link for that package.
University projects/ Personal projects
Interest in machine translation
It is quite uncomfortable to regard language just as a method for communication. Huge experiences, history and views of the world human being have been soaked deeply into a language since the beginning of the world. I come from country that houses over 1600 languages of which around 150 have a sizeable speaking population even today. Most of them are on the verge of extinction and with ordered morphological rules, rule based machine translation is the most promising and viable option for their restoration.
As I have taken up Formal Languages and Automata Theory course this semester in my university, I am familiar with the concepts of Deterministic Finite Automata, Non-deterministic Finite Automata, Weighted Transducers, Finite State Machines and other concepts essential to take up a project in machine translation.
Interest in Apertium project
Apertium is a rule-based machine translation platform. I began working on Apertium about an year ago to modify apertium-lint tool to support multiple configuration files, as it was essential for writing the ApertiumlintBear module for coala.
From then, I wanted to explore the territory of linguistics and machine translation. Apertium is quite old and a popular project and is even used by Wikimedia for translation purposes. It is a very active open-source organisation where maintenance and development cycle are pretty regular and therefore it is a great learning platform for newbies to learn from the veterans through interaction and contribution.
Apertium: Extend lttoolbox to have the power of HFST
Lttoolbox is a toolbox in the Apertium project for lexical processing, morphological analysis and generation of words. With the current format of Apertium it is quite hard to perform complex morphological transformations with lttoolbox. This project aims on extending the capability of performing morphographemics and adding lexical weights to the lttoolbox transducer in order to enable more complex translations with the transducer. By the end of this project, lttoolbox will be sophisticated enough to support most of the language pairs.
The aim of this project is to implement the support for morphographemics and weights in the lttoolbox transducer. The proposal focuses on extending lttoolbox to perform the complex morphological transformations and weight based analyses currently done in HFST and writing a module that translates the current HFST format to the new lttoolbox format.
Motivation / Benefit to the society
Currently, most languages pairs in Apertium use the lttoolbox due to its straightforward syntax and has some very useful features for validation of lemmas, but it is not suited to agglutinative languages, or languages with complex and regular morphophonology. This void thus makes it harder for new developers for they must necessarily learn working with both the platforms to improve or start a new language pair with Apertium. Here comes the necessity for having a single module in Apertium for all the language pair translations to make the Apertium software more developer friendly.
Reason for sponsorship
I am a volunteer and am working in my free time to improve Apertium. I am a student and I need to earn money during the summers one way or another. A sponsorship can reinforce my self-confidence and give me greater powers of resilience. The writing of a preprocessor for lttoolbox is a huge effort, as a big rework is required for this, but with high benefits in the area of machine translation. I would like to flip bits instead of burgers this year. Making a full job out of this means that I can focus all of my energy and dedication to this project, and this effort needs full time work to be successful.
Currently, the basic workflow of morphological transformations in Apertium pipeline is:
The morphological analysis and generation in lttoolbox which specifies mapping between morphological and morphophonological form of words is a straightforward process. It merely constructs a tree of paradigms and iterates through all its branches to join all the possible lemmas and tags present in the dictionary to construct all the word-forms for a root word. There is no facility for rule based transformations in lttoolbox to modify the morphemes when they are joined together, which makes it hard to construct a dictionary for languages with complex phonological rules.
Two level rule formalism is based on concept, that there is a set of possible character pairs or correspondences in languages, and rules specify contexts in which these correspondences appear. These come handy in situations where lexicons have created a kind of a morphophonemic level, that defines word-forms as patterns constructed from regular letters and possibly morphophonemics, which may vary on context rules.
The implementation of the proposal will go through the following steps:
- Writing a converter from HFST format to lttoolbox format
- Writing a module to support weights in lttoolbox
- Assigning weights using gold-standard corpora
- Adding format for morphographemics
- Writing module for lttoolbox to support morphographemics
- Testing and CI builds
There is a close resemblance between the currently used ‘.dix’ format in lttoolbox and the ‘.lexc’ file in HFST format. This is due to the fact that both of these files describe how morphemes in a language are joined together. They are basically the doppelganger of each other and only differ in the terminologies used. We need to parse through the lexc files and store the entries in the form of a python dictionary and rewrite them with appropriate xml tags to construct a corresponding dix file. This module can be further modified once we make any further alterations in the format of lttoolbox to support morphographemics and weights. I am already working on this module as a part of my coding challenge.
Weights in lttoolbox will defined in the usual manner, smaller weights implying the better form to generate in case of a free variation. In a weighted finite-state transducer each arc and consequently each path is assigned a weight. The weight of a path in a transducer is usually some combination of the weights of the arcs (using the operators ⊕ and ⊗ for addition or multiplication). In lexc files we define weights for specific arcs in a special section after the continuation class. We need a similar approach in the dix files for assigning weights to the arcs. The weights are helpful in determining the most desirable analyses for a morpheme. We therefore have to come up with a module in lttoolbox to select and combine the paths with the least weights as long there is a valid transduction possible to generate the most appropriate lexicons after morphological analysis.
As dictionary files in lttoolbox already have different sections, implementing weights in section is much easier than introducing arc-weights in the transducer. In this way we can have heuristics in sections with higher weights, that don't pollute the output in the "good" known analyses.
Every finite state transducer must specify an initial state, the final weights, arc and epsilon counts per states, an fst type name, the fst's properties, how to copy, read and write the transducer, and the input and output symbol tables. To assign weights to states and arcs in the lttoolbox transducer, an unsupervised training method will be implemented. They will be calculated by using a standard gold-tagged parallel corpora set up fully in a direction we intend to train for and at least until the pre-transfer stage in the opposite direction. A method similar to estimation of rules in lexical selection process will be deployed in the algorithm to train for estimating the weight values in the corpora. The training algorithm will be integrated with the filters present in the openfst library to determine the matches which will be allowed to proceed in composition or handle correct epsilon matching.
As currently lttoolbox doesn’t support any kind of rule based character alterations it is practically not possible to write dictionary for languages with complex phonologies. Rules in hfst are declared in a separate text file and hfst-twolc combines lexicons based on these rules to produce a lexical transducer. Twol compilation is quite a complex task with a high tendency for every minute details to fail easily if not handled properly. Thus enabling two level rule formalism in lttoolbox from scratch is not efficient and advisable. We can fork essential features from the currently existing hfst-twolc for enabling rule based morphology in lttoolbox.
We also need to come up with a format for writing rules in lttoolbox which includes deciding on the input symbols and conditions for deletion, insertion and symbol change. The format to be implemented must be easy to interpret and not misleading in any case. Therefore this format will be decided after a thorough discussion with mentors and other team members.
A new module in lttoolbox will be necessary for handling the new rule based morphographemics and finally combine the rules with the lexicons to generate results. This module will parse through the rules declared in the dix file and process them with the help of twol formalism backend support provided by hfst library. Finally it has to combine the results to generate a lexical transducer.
We also need to modify the existing modules in lttoolbox like the lt-expand and the lt-trim in order to generate proper results in sync with the new alterations in lttoolbox.
As of now there is no concrete prototype of what the new dix file and lttoolbox will look like after the completion of this project, but it is quite evident that changes in the codebase and the new format will be drastic after addition of the new features. Lttoolbox is an integral part of Apertium and we cannot afford to generate errors during any kind of morphological analysis and generation process using this module. For that we have to ensure that lttoolbox remains compatible with the older version of dix files. Thus, we must backport all the necessary changes to previous versions of lttoolbox for the smooth functioning of the lttoolbox module.
When working in a collaborative environment, maintaining proper documentation is very important. Also users and new developers rely upon documentation tutorials and wiki to learn about the changes in codebase. Therefore a proper documentation which includes proper comments in between the code as well, has to be written that will help users and developers familiarize with the the new developments in the lttoolbox and their usage.
The previously written wiki and documentation will also require a proper review and may be a rewrite in case of any significant modifications in the features.
In software development and open source projects, automated testing is very important to maintain the huge codebase. A 100% code coverage policy for tests must be followed for having a completely error-free and efficient code. Apertium also uses Travis CI for continuous integration builds after its migration to Github platform. Test modules allow maintainers to keep a proper track of the existing codebase and prevent any non-mergeable code getting merged into main code-base, to prevent build errors. Testing modules are therefore important and hence it is necessary to write tests for the addition of any chunk to the codebase.
Test modules have to be written for the new modifications in the lttoolbox transducer and the lexc2dix module along with any possible modifications in the existing python test files.
Testing and documentation will always go on side by side along with the implementation of other milestones.
Road to the future
Both the lexical weighting and implementation are complex processes and do not have high chances of getting a full coverage in the first iteration of their implementation. For developers to benefit from the support of morphographemics and weights in lttoolbox, it is necessary to make these implementations robust enough to handle all kind of errors and fallbacks. The major stone left unturned after this project is the case specific debugging and error handling mechanism which will take some time reiterating through the entire process and fixing the issues that may arise after the test run of the new lttoolbox.
Community Bonding Period (April 23 - May 13)
Do an extensive research on the currently used architecture for morphological analysis. I will spend most of the proposal review period and the community bonding period on improving and documenting the lexc2dix module and reading materials for the project. As I am targeting to have a full fledged implementation of the proposal, it will require me to go through spectei’s tutorial and Two-level compiler book thoroughly during this period to have a better understanding of the twolc rules and weighted lexicons implementation.
All the possible changes and modifications in proposal and their implementation details will be discussed with mentors, and changes will be made accordingly. I will also work on the existing issues to gain a stronger hold on the codebase of hfst and lttoolbox. All possible modifications in the timeline schedules will be made during this period.
Coding Phase I (May 14 - June 10)
- Week 1 (May 14 - May 20)
- This phase will be entirely centered around implementation of weights in Apertium. A new module will be introduced in lttoolbox for this purpose.
- This week’s task will be mostly setting up the basic framework for module that will handle weighted lexicons and arcs. We also need to setup the function endpoints for composition, determinization, epsilon removal and other useful algorithms.
- Week 2 (May 21 - May 27)
- Uptil now we mostly have a basic structure of the module for handling weights. We need to extend the capabilities and functionalities of this module for selecting and combining arcs and section with minimum weights to generate the most desirable output string as long as valid transduction is possible.
- Details about the newly implemented module for handling weights will be blogged during this period. Also documentation work has to be done during this period to let Apertium developers know what this new module in lttoolbox is all about..
- Week 3 (May 28 - June 3)
- As we have the initial and primitive kind of implementation of a weighted transducer in hand now, we now need to extend it to cover all kind cases and fallbacks.
- We now need to modify these tools and make it robust enough by debugging and handling error cases.
- Entries with no weights will be handled appropriately and the output with the higher weights will be selected if lower weight combinations are invalid.
- Week 4 (June 04 - June 10)
- Proper tests and documentation will be written for the changes done till now to ensure mergeability of the code in the upstream branch of lttoolbox.
- As this is the buffer week, all the changes will be properly reviewed and minor fixes will be done in the code if required.
- The new weighted transducer will be the topic of discussion in the blog for the week.
Coding Phase II (June 11 - July 8)
- Week 5 (June 11 - June 17)
- Now that we have the necessary tools for acquiring and modifying weights and the syntax for weights, we must now pay attention to adding weighted arcs, sections and paradigms in lttoolbox.
- It is practically impossible to manually examine each entries and allocate weights. Thus we must have a training algorithm to allocate weights to each of these fragments of lttoolbox in an unsupervised manner.
- Week 6 (June 18 - June 24)
- In this week, most of the work will be forking the essential features from hfst to be used for handling twol formalism in the lttoolbox.
- As hfst already has a working implementation of two level rule formalism, we can use it as a library and build the twol implementation in lttoolbox on the top of it.
- The format of writing rules in the dictionary files and symbols as discussed in the community bonding period will be put on a test run to check its viability.
- Week 7 (June 25 - July 1)
- Now we need some kind of validation in lttoolbox for the new additions in the dix files, to prevent unwanted additions. We can make appropriate changes in the apertium validation script to check the format of dictionary.
- Proper debugging and refinements will be done to have a well functioning twol handling module in lttoolbox.
- Well by the end of the second week, the refined version of the module for handling rules in lttoolbox transducer will be available. The GSoC blog will be written describing about this module and how it is going to help.
- Week 8 (July 2 - July 8)
- Tests and documentation for the changes in code will be written during this period. Proper tutorial describing about the lt-twol module will be properly documented.
- This will basically be the buffer week for the first coding phase. Any minor issues will be fixed in this period to ensure merge ability of the code into the core repository. As this is the last week of the first coding phase, I will blog my GSoC experience too.
Coding Phase III (July 9 - August 14)
- Week 9 (July 9 - July 15)
- Now that we have a module for handling rules in lttoolbox, we need to align it properly with the semantics of lttoolbox.
- A proper parsing library has to be written to parse through the definition of the rules in the dictionary file and generate lexicons corresponding to them.
- Week 10 (July 16 - July 22)
- Any possible improvements and modifications to the rule parsing library will be done.
- We now need to extend the utility of weights to rule-based analysis. The training algorithm for weights will be extended to rules in the new lttoolbox. Also there will be rules for allocating weights to stuffs in the monolingual dictionaries of Apertium.
- Proper test modules will be defined to reduce the likelihood of errors in the codebase.
- The GSoC blog will also be maintained and all the exciting stuffs related to this project will be written down.
- Week 11 (July 23 - July 29)
- To inform all the users and developers about all the major changes, they have to be well documented. So this week will be all about maintaining and having new documentations explaining the changes in the codebase. There will also be a tutorial explaining how to write and modify the new dix files, i.e. Starting a language with lttoolbox wiki will be modified.
- All the updates required in the wiki will be done during this period, and all the previously written documentation will be properly reviewed.
- Week 12 (July 30 - August 5)
- All the work related to backporting and maintaining backwards compatibility will be completed in this week.
- There are high chances that these additions may lead to compilation errors in the Apertium pipeline. Any new chunk of code causing any kind of errors and conflict with the previous format of lttoolbox will be examined thoroughly and fixed.
- This week will basically be the pre-final buffer week. All the previously made commits will be merged into master branch. Any minor changes if required for debugging or enhancement, will be made during this period.
- The final GSoC blog will be written explaining all about my work and commits made throughout the GSoC time period.
- Week 13 (August 6 - August 14)
- This will be the final buffer week of my GSoC period. A proper analysis will be done on the merged commits and issues regarding extending and implementing new aspect classes will be opened up during this period.
- As the proper integration of weights has been done and the basic skeleton for the implementation of rules has been simplified by the end of this project, lttoolbox will be ready now for all the languages that are supported by Apertium.
GSoC timeline is all about commitment and I really understand that doing things properly is more important than doing more things.
Therefore I didn’t mention to support weights throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking in my GSoC timeline. The addition of weights will be mostly basic and limited to the monolingual dictionary formats only. After discussions with my mentors I will try to improve my timeline by including some more tasks or scraping off some, to make the most out the GSoC period.
- Install Apertium.
- lexc2dix → hfst lexc/twolc to monodix converter.
I consider GSoC as a full time job and do not have any other commitments as such. I only have my departmental summer course during this period which won’t take much of my time. I am going to spend most of my time coding and reading wiki and articles for my project. I am willing to contribute 40+ hours per week for my GSoC project and may extend that if required to fulfill the milestones of my timeline. In case of any unplanned schedules or short-term commitments, I will ensure that my mentors are informed about my unavailability beforehand and make suitable adjustments in my schedule to compensate for that. My university has end semester examinations in the beginning of the month of May and I will take a short vacation from 10th May to 15th May. So, I will be a bit occupied during the Community Bonding Period, but surely I’ll ensure that it won’t affect my dedication for GSoC.