User:Ergaurav3/GSOC Application2:Plain-text formats for Apertium data

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Gaurav Agrawal

Email: ergaurav2@gmail.com

IRC: ergaurav2

GitHub: https://github.com/ergaurav2


Why is it you are interested in machine translation?

Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.

Human Translators are not very feasible and available.

It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.


Why is it that you are interested in the Apertium project?

As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.

Which of the published tasks are you interested in? What do you plan to do?

Title

   Plain-text formats for Apertium data

Why Google and Apertium should sponsor it

The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. The current way of writing the dictionary and the transfer rule is in the format of xml which may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.

How and who it will benefit in society,

This will make the development of the transfer rule and dictionary easier for many developers and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.

Work plan

What work I have already done ?

I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.

Installation of the apertium, joining the mailing list, IRC, source forge.

Working with the community members since last months with the help of IRC and mailing list

I have read the research paper for the InterNostrum and MorphTrans [1][2] and understand the format.

Coding Challenge

I have successfully completed the challenge and also reviewed by the mentor Mikel.

Same is available on the github. Link: GitHub Link

As a part of Coding Challenge [3] I have write the parser to convert a *.mode shell-script fragment into a modes.xml file. To the addition to the Coding Challenge I have also write a parser to convert the modes.xml file into the *.mode frgament.

As suggested by the Mikel, It is good to see the Coding Challenge for the Project Unify the Metadix Formats, I have already attempted that coding challenge. Available on the Github. GitHub Link


Project Understanding

The project is majorly consist of two parts:

1. The round conversion of the the transfer rules.

2. The round conversion of the dictionary files.

Problem with Current Scenario:

Currently we have both the transfer rules and the dictionary files in the form of the xml.

Many developers are comfortable with these xml formats but some found it more easier to write the data in the text-formats.

Solution/Approach:

Part 1: Conversion of the Transfer Rules

A MorphTrans-style text-format will be specified for the Transfer rules XML files.

A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML Files.

Are there any things not covered by the MorphTrans format that need taking care of? --Mlforcada (talk) 06:42, 21 March 2014 (UTC)

A tool will be developed using the xslts to convert the current transfer rule XML files into the MorphTrans-style text-format rule files.

Regression Testing tool Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.

It should be a round-trip check rather than regression testing. --Mlforcada (talk) 06:44, 21 March 2014 (UTC)


Part 2: Conversion of the Dictionary

A MorphTrans-style text-format will be specified for the Dictionary XML files.

Morphtrans is a format for structural transfer rules, not for dictionaries. Connect this to the references given above --Mlforcada (talk) 06:42, 21 March 2014 (UTC)

A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the dictionary XML Files.

A tool will be developed using the xslts to convert the current dictionary XML files into the MorphTrans-style text-format dictionary files.

Do you think a single XSLT stylesheet will do or will multi-pass be necessary as with Metadix? --Mlforcada (talk) 06:42, 21 March 2014 (UTC)

Regression Testing tool Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.

It should be a round-trip check rather than regression testing. --Mlforcada (talk) 06:44, 21 March 2014 (UTC)
As a cherry on the cake, is there any way you could, in an application, keep both formats updated when someone is editing the simpler format? --Mlforcada (talk) 06:44, 21 March 2014 (UTC)

Yes, we can update the Makefile so that if we a user have modified any of the formats during make the other format also get updated. But yes, you will need to call the make before compiling the dictionary to have the changes considered.

Work Timeline

Community Bonding Period :

Create a wiki page about the project so all the information about the project progress can be there.

Investaging more about the dictionary and the transfer rule files.

Gather up the resources needed to start this project.

Week 1:

Finalize the MorphTrans-style text-format for the transfer rules.


Week 2:

Develop the tool in Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML.

Week 3:

Develop the tool using the xslts to convert transfer rule XML files into the MorphTrans-style text-format rule file.

Week 4:

Creating Regression Testing Tool and perform the regression testing for the both way of conversion.

Deliverable # 1 : The final tool for the both way conversion of the format of Transfer Rules

Week 5:

Finalize the MorphTrans-style text-format for the Dictionary Files.


Week 6:

Develop the tool in Java for the conversion of the MorphTrans-style text-format files into the dictionary XML.

Week 7:

Develop the tool using the xslts to convert dictonary XML files into the MorphTrans-style text-format rule file.

Week 8:

Creating Regression Testing Tool and perform the regression testing for the both way of conversion.

Deliverable # 2 : The final tool for the both way conversion of the format of Dictionary Files


Week 9:

Finalize the MorphTrans-style text-format for the meta Dictionary Files.

Updating the tool that was creating for Dictionary Files to also provide the conversion of the MorphTrans-style text-format files into the meta dictionary XML.

Week 10:

Updating the tool that was creating for Dictionary Files to also convert also dictonary XML files into the MorphTrans-style text-format rule file.

Week 11:

Creating Regression Testing Tool and perform the regression testing for the both way of conversion.

Week 12:

Final Documentation



List your skills and give evidence of your qualifications.

Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.

I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page. Web Resume Link

I work on the Linux operating system and have a good knowledge of command.

I have industrial experience of the two years to work on the project DALI for the Airbus that involves the processing of the large input XML that involves treating them, performing transformation, validation and generating desire output XML file.

I have done a project Creating a Indexing of the Wiki Data and providing search engine for the same with the help of Java and the XML Parsing.Git Hub Link

I have the good knowledge of writing the shell scripts as I have taken a course Scripting and the Computing Environment Git Hub Link

I have also worked on the project on python for creating a placement portal.Git hub link: Git Hub Link

I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: Git Hub Link

List non-Summer-of-Code plans you have for the Summer

Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.

I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time.

My classes will start from the 1st Aug'14 but as it will be the beginning of the session so there will be no impact on the project work due to classes and also at that time, we will be in the completing phase of the project so I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.