Difference between revisions of "User:Ergaurav3/GSOC Application1:Unify the metadix formats"

From Apertium
Jump to navigation Jump to search
Line 245: Line 245:
 
My classes will start from the 1st Aug'14 but as it will be the beginning of the session so there will be no impact on the project work due to classes and also at that time, we will be in the completing phase of the project so I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.
 
My classes will start from the 1st Aug'14 but as it will be the beginning of the session so there will be no impact on the project work due to classes and also at that time, we will be in the completing phase of the project so I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.
   
== Comments/Suggestion ==
+
== Comments/Suggestion/Feedback ==

Revision as of 11:02, 20 March 2014

Contact Information

Name: Gaurav Agrawal

Email: ergaurav2@gmail.com

IRC: ergaurav2

GitHub: https://github.com/ergaurav2


Why is it you are interested in machine translation?

Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.

Human Translators are not very feasible and available.

It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.


Why is it that you are interested in the Apertium project?

As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.

Which of the published tasks are you interested in? What do you plan to do?

Title

Unify the Metadix Formats

Why Google and Apertium should sponsor it

The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. There is variation among the meta dictionary in the language pair which is increasing with the addition of the new language pair. This variation is getting more complex and the different metadix formats are kind of depend on the developer of the language pair. So, there is need of removing these complexity and the variation to make the things simple for the system.

How and who it will benefit in society,

This will make the use of the language pair that involves the meta dictionary formats easier and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.

Work plan

What work I have already done ?

I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.


Installation of the apertium, joining the mailing list, IRC, source forge

Working with the community members since last months with the help of IRC and mailing list

Understanding of the dictionary formats

Understading of the different type of Metadictionary formats

Role of the alternative/variant (alt) in the dictionary.

Coding Challenge

I have successfully completed the challenge and also reviewed by the mentor.

Same is available on the github. Link: [1]



Finding about the topics in the project

Metadictionary Formats

We need to apply the pre-processing by the xslt files to convert the metadix The formats en-ca, en-es, fr-es, oc-ca, oc-es contains the metadix format files. The metadix are converted into the dictionary format with the use of basically two xslts file buscPar.xsl and the principal.xsl. buscPar.xsl : It is basically used to do the sorting and the creation of intermediate xsl file with list of verbs that use metaparadigms and also we perform the uniq to delete the duplicates. principal.xsl: This stylsheet actually contains the rule for the conversion of the <par> and the <sa> tags to form the dictionary format (.dix). There are two types of the meta dictionary files (metadix) files:

One contains the <par> tags and the other contains the <sa> tags

   <sa/> tags: en-ca, en-es (en.metadix)
   <prm/> tags: br-fr, fr-es, oc-ca, oc-es (fr.metadix, oc.metadix)

The tag <prm/> is a marker that is used to place variable text part in the paradigm definition. The tag <sa> is placed where an optional grammatical symbol (tag) should appear.

For the case of <sa/>, we have: <e lm="time"> time <par n="house__n" sa="unc"/> </e>

In the pardef, <sa/> will be replaced by the <s n="unc">

For the case <prm/>, if we have something like: <e lm="acu ́lher"> e acu <par n="m/ ́[T]er__vblex" prm="lh"/> e </e>

In the pardef, <prm/> will be replaced by the string "lh".

Alternative/Variant (“alt” or the “v” attribute):

I have considered three language pair for looking into the behaviour of the alt and the v attribute: oc-ca, es-ca and the es-pt

The alt.xsl is used for the conversion using the “alt” attribute.

The filter.xsl is used for the conversion using the “v” attribute.

For the oc-ca we don't have the filter.xsl all the things are done with only the alt.xsl For the es-pt we don't have the alt.xsl and all the things are done using only the filter.xsl For the es-ca we have the both alt.xsl and the filter.xsl.



Project Understanding

The project is majorly consist of two parts:

1. The metadictionary formats 2. The conversion using the alt and the v attributes for the variant.

Problem with Current Scenario:

1. Meta dictionary formats: The xslt and other pre-processing needs to applied manually on the meta dictionary files before we actually compile the dictionary files. The xslt is different for the both, one with the prm tags and the other with the sa tags.

2. Variant (alt/v attributes) conversion:

The attribute alt and v are used for the common behaviour but have the different treatment with the help of the alt.xsl and the filter.xsl respectively.

Solution/Approach:

Part 1: Make the Metadix compatible

Phase 1: a) Generalize the buscPar.xsl and the principal.xsl for the both type of the meta dictionary formats (one with the prm and other with the sa)

b) Creating a tool in form of shell scripts that convert the meta dictionary file into the dictionary file.

c) Creating a tool with the help of xslt ans shell script making the changes in the meta dictionary files to make them compatible to each other with a same defined format in the form of metadix.dtd and also creating apertium-validate-metadix similar to apertium-validate-dix

d) Creating Regression testing tool with the help of shell scripts to check

Phase 2:

lt-comp can be updated to take directly the meta dictionary files as the input to compile like lt-comp lr apertium-en-ca.en.metadix apertium-en-ca.en.acx. This will first convert the meta dictionary file apertium-en-ca.en.metadix into the apertium-en-ca.en.dix file and then compile it.

Creating Regression Testing Tool the their is no impact with the changes done in the pre-processing xslts and the meta dictionary files and also the changes in the ltoolbox have not impacted the behaviour of the normal language that don't have the meta dictionary files.

Part 2. Make the Variant (alt/v) common:

a) The behaviour of the attribute v can be merged with the attribute alt by updating the dictionary file (.dix).

b) Generalize the alt.xsl and filter.xsl into the alt.xsl for the treatment of the variant part.

c) Creating Regression testing tool with the help of shell scripts to check their is no impact with the changes done in the xslts file and the dictionary files.

Work Timeline

Community Bonding Period :

Create a wiki page about the project so all the information about the project progress can be there.

Familiarize more with Meta dictionary format and variants behaviour and the lltoolbox.

Gather up the resources needed to start this project.

Week 1-2: Understanding of the pre-processing xslts used for the both type of the meta dictionary files and finding the points to generalize them.

Create the Generalized xslts files that can be used to both type of meta dictionary files conversion.

Week 3-4: Making changes in meta dictionary files to make them compatible and creating a tool apertium-validate-metadix to validate meta dictionary files.

Updating the Makefile with the new changes.

Deliverable # 1 : The generalized version of the xslts and the Pre-Processing for the Meta dictionary formats

Week 5-7: Understanding of the lt-comp to update it for the compilation of the meta dictionary.

Understand the impact of the changes in the lt-comp to the current normal dictionary files.

Modification in the lt-comp tool to handle the meta dictionary.

Week 8: Creating Regression Testing Tool and perform the regression testing for the changes done in the pre-processing xslts and the meta dictionary files and the lt-comp to handle meta dictionary.

It will involve testing on both, the meta dictionary and the dictionary format.

Deliverable # 2: The final version of the pre-processing tool and the lt-comp.

Week 9-10: Understand the behaviour of the attribute v and the attribute alt for the variants.

Updating the dictionary that are using the attribute v to use the attribute alt.

Merging the alt.xsl and the filter.xsl into the alt.xsl

Updating the Makefile with the new changes.

Week 11:

Creating Regression Testing Tool and perform the regression testing for the changes done in the alt.xsl and the dictionary files.

Week 12: Final Documentation.


List your skills and give evidence of your qualifications.

Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.

I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. I work on the Linux operating system and have a good knowledge of command.

I have industrial experience of the two years to work on the project DALI for the Airbus that involves the processing of the large input XML that involves treating them, performing transformation, validation and generating desire output XML file.

I have done a project Creating a Indexing of the Wiki Data and providing search engine for the same with the help of Java and the XML Parsing. Git hub link: [2]

I have the good knowledge of writing the shell scripts as I have taken a course Scripting and the Computing Environment Git hub link: [3]

I have also worked on the project on python for creating a placement portal.Git hub link: [4]

I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: [5]

List non-Summer-of-Code plans you have for the Summer

Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.

I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time.

My classes will start from the 1st Aug'14 but as it will be the beginning of the session so there will be no impact on the project work due to classes and also at that time, we will be in the completing phase of the project so I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.

Comments/Suggestion/Feedback