User:Ergaurav3/GSOC Application1:Unify the metadix formats
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Work plan
- 6 Work Timeline
- 7 List your skills and give evidence of your qualifications.
- 8 List non-Summer-of-Code plans you have for the Summer
Name: Gaurav Agrawal
SourceForge : ergaurav2
WebLink : http://web.iiit.ac.in/~gaurav.agrawal/
Why is it you are interested in machine translation?
Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.
Human Translators are not very feasible and available.
It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.
Why is it that you are interested in the Apertium project?
As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.
Which of the published tasks are you interested in? What do you plan to do?
Unify the Metadix Formats
Why Google and Apertium should sponsor it
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. There is variation among the meta dictionary in the language pair which is increasing with the addition of the new language pair. This variation is getting more complex and the different metadix formats are kind of depend on the developer of the language pair. So, there is need of removing these complexity and the variation to make the things simple for the system.
How and who it will benefit in society,
This will make the use of the language pair that involves the meta dictionary formats easier and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.
What work I have already done ?
I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.
Installation of the apertium, joining the mailing list, IRC, source forge
Working with the community members since last months with the help of IRC and mailing list
Understanding of the dictionary formats
Understading of the different type of Metadictionary formats
Role of the alternative/variant (alt) in the dictionary.
I have successfully completed the challenge and also reviewed by the mentor.
Same is available on the github. Link: Git Hub Repo Link
Finding about the topics in the project
We need to apply the pre-processing by the xslt files to convert the metadix The formats en-ca, en-es, fr-es, oc-ca, oc-es contains the metadix format files. The metadix are converted into the dictionary format with the use of basically two xslts file buscPar.xsl and the principal.xsl. buscPar.xsl : It is basically used to do the sorting and the creation of intermediate xsl file with list of verbs that use metaparadigms and also we perform the uniq to delete the duplicates. principal.xsl: This stylsheet actually contains the rule for the conversion of the <par> and the <sa> tags to form the dictionary format (.dix). There are two types of the meta dictionary files (metadix) files:
One contains the <par> tags and the other contains the <sa> tags
<sa/> tags: en-ca, en-es (en.metadix) <prm/> tags: br-fr, fr-es, oc-ca, oc-es (fr.metadix, oc.metadix)
The tag <prm/> is a marker that is used to place variable text part in the paradigm definition. The tag <sa> is placed where an optional grammatical symbol (tag) should appear.
For the case of <sa/>, we have: <e lm="time"> time <par n="house__n" sa="unc"/> </e>
In the pardef, <sa/> will be replaced by the <s n="unc">
For the case <prm/>, if we have something like: <e lm="acu ́lher"> e acu <par n="m/ ́[T]er__vblex" prm="lh"/> e </e>
In the pardef, <prm/> will be replaced by the string "lh".
Alternative/Variant (“alt” or the “v” attribute):
I have considered three language pair for looking into the behaviour of the alt and the v attribute: oc-ca, es-ca and the es-pt
The alt.xsl is used for the conversion using the “alt” attribute.
The filter.xsl is used for the conversion using the “v” attribute.
For the oc-ca we don't have the filter.xsl all the things are done with only the alt.xsl For the es-pt we don't have the alt.xsl and all the things are done using only the filter.xsl For the es-ca we have the both alt.xsl and the filter.xsl.
The project is majorly consist of two parts:
1. The metadictionary formats 2. The conversion using the alt and the v attributes for the variant.
Problem with Current Scenario:
1. Meta dictionary formats: The xslt and other pre-processing needs to applied manually on the meta dictionary files before we actually compile the dictionary files. The xslt is different for the both, one with the prm tags and the other with the sa tags.
2. Variant (alt/v attributes) conversion:
The attribute alt and v are used for the common behaviour but have the different treatment with the help of the alt.xsl and the filter.xsl respectively.
Part 1: Make the Metadix compatible
Phase 1: a) Generalize the buscPar.xsl and the principal.xsl for the both type of the meta dictionary formats (one with the prm and other with the sa)
b) Creating a tool in form of shell scripts that convert the meta dictionary file into the dictionary file.
c) Creating a tool with the help of xslt ans shell script making the changes in the meta dictionary files to make them compatible to each other with a same defined format in the form of metadix.dtd and also creating apertium-validate-metadix similar to apertium-validate-dix
d) Creating Regression testing tool with the help of shell scripts to check
Investigating the possibilty of the integration with the lt-comp.
lt-comp can be updated to take directly the meta dictionary files as the input to compile like lt-comp lr apertium-en-ca.en.metadix apertium-en-ca.en.acx. This will first convert the meta dictionary file apertium-en-ca.en.metadix into the apertium-en-ca.en.dix file and then compile it.
Creating Regression Testing Tool the their is no impact with the changes done in the pre-processing xslts and the meta dictionary files and also the changes in the ltoolbox have not impacted the behaviour of the normal language that don't have the meta dictionary files.
Part 2. Make the Variant (alt/v) common:
a) The behaviour of the attribute v can be merged with the attribute alt by updating the dictionary file (.dix).
b) Generalize the alt.xsl and filter.xsl into the alt.xsl for the treatment of the variant part.
c) Creating Regression testing tool with the help of shell scripts to check their is no impact with the changes done in the xslts file and the dictionary files.
Analysis for the support to make the lttoolbox to be able to complile the Meta dictionary :
1. The tags like <sa/> and <prm/> are not recognized by the llttoolbox. --> We need to add their parsing to the complier (updating compiler.cc).
2. In the pardef, <sa/> will be replaced by the <s n="unc">
The actual value of the sa attribute (<e lm="time"> time <par n="house__n" sa="unc"/> </e> ) with we will replace the tag sa and make the new tag s is found letter in the metadix doucment.
So, we either we need to parse twice or have this information prior.
--> We may introduce something like pair in the beginning of the dictionary to have the value with which sa tag needs to be replaced.
<sapair> <lm>time</lm> <sa>unc</sa> </sapair>
3. We will store the sapair in some data structure like map.
4. In compiler.cc, we will add the treatment for a </sa> similar to the s tag with which it's replaced later.
Community Bonding Period :
Create a wiki page about the project so all the information about the project progress can be there.
Familiarize more with Meta dictionary format and variants behaviour and the lltoolbox.
Gather up the resources needed to start this project.
Investigating the pre-processing xslts used for the both type of the meta dictionary files.
Finding the points to generalize the pre-processing xslts..
Create the Generalized xslts files that can be used to both type of meta dictionary files conversion.
Making changes in meta dictionary files to make them compatible.
Creating a tool apertium-validate-metadix to validate meta dictionary files.
Updating the Makefile with the new changes.
Creating Regression Testing Tool and perform the regression testing for the changes done in the pre-processing xslts
Deliverable # 1 : The generalized version of the xslts and the Pre-Processing for the Meta dictionary formats
Investigating the lt-comp for the possibilty of the integration of the compilation of the meta dictionary.
Investigating the impact of the changes in the lt-comp to the current normal dictionary files.
Modification in the lt-comp tool to handle the meta dictionary.
Creating Regression Testing Tool the lt-comp to handle meta dictionary.
Deliverable # 2: The final version of the pre-processing tool and the lt-comp.
Investigating the behaviour of the attribute v and the attribute alt for the variants.
Merging the alt.xsl and the filter.xsl into the alt.xsl
Updating the dictionary that are using the attribute v to use the attribute alt.
Updating the Makefile with the new changes.
Creating Regression Testing Tool and perform the regression testing for the changes done in the alt.xsl and the dictionary files.
List your skills and give evidence of your qualifications.
Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.
I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page: Web Page Gaurav Agrawal
I work on the Linux operating system and have a good knowledge of command.
I have industrial experience of the two years to work on the project DALI for the Airbus that involves the processing of the large input XML that involves treating them, performing transformation, validation and generating desire output XML file.
I have done a project Creating a Indexing of the Wiki Data and providing search engine for the same with the help of Java and the XML Parsing.Git Hub Link
I have the good knowledge of writing the shell scripts as I have taken a course Scripting and the Computing Environment Git Hub Link
I have also worked on the project on python for creating a placement portal.Git hub link: Git Hub Link
I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: Git Hub Link
List non-Summer-of-Code plans you have for the Summer
Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.
I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will be working full time on the project and will able to devoted alteast 40 hours to the project or more depending on the need arises with the time.
My classes will start from the 1st Aug'14 (only around 2-3 hours will be spend in classes during weekdays) and as it will be the beginning of the new session, there will be no impact on the project work due to classes and we will be in the completing phase of the project. I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.