User:Ergaurav3/GSOC Application2:Plain-text formats for Apertium data
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Work plan
- 6 Work Timeline
- 7 List your skills and give evidence of your qualifications.
- 8 List non-Summer-of-Code plans you have for the Summer
Contact Information
Name: Gaurav Agrawal
Email: ergaurav2@gmail.com
IRC: ergaurav2
GitHub: https://github.com/ergaurav2
Why is it you are interested in machine translation?
Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.
Human Translators are not very feasible and available.
It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.
Why is it that you are interested in the Apertium project?
As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.
Which of the published tasks are you interested in? What do you plan to do?
Title
Plain-text formats for Apertium data
Why Google and Apertium should sponsor it
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. The current way of writing the dictionary and the transfer rule is in the format of xml which may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.
How and who it will benefit in society,
This will make the development of the transfer rule and dictionary easier for many developers and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.
Work plan
What work I have already done ?
I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.
Installation of the apertium, joining the mailing list, IRC, source forge.
Working with the community members since last months with the help of IRC and mailing list
I have read the research paper for the InterNostrum and MorphTrans [1] [2] and understand the format.
Coding Challenge
I have successfully completed the challenge and also reviewed by the mentor Mikel.
Same is available on the github. Link: GitHub Link
As a part of Coding Challenge [3] I have write the parser to convert a *.mode shell-script fragment into a modes.xml file. To the addition to the Coding Challenge I have also write a parser to convert the modes.xml file into the *.mode frgament.
As suggested by the Mikel, It is good to see the Coding Challenge for the Project Unify the Metadix Formats, I have already attempted that coding challenge. Available on the Github. GitHub Link
Project Understanding
The project is majorly consist of two parts:
1. The round conversion of the the transfer rules.
2. The round conversion of the dictionary files.
Problem with Current Scenario:
Currently we have both the transfer rules and the dictionary files in the form of the xml.
Many developers are comfortable with these xml formats but some found it more easier to write the data in the text-formats.
Solution/Approach:
Part 1: Conversion of the Transfer Rules <> A MorphTrans-style text-format will be specified for the Transfer rules XML files.
We will be using the research paper available here for the same.
A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML Files.
I will anlayze the MorphTrans Further and try to figure out the same.
A tool will be developed using the xslts to convert the current transfer rule XML files into the MorphTrans-style text-format rule files.
Regression Testing tool Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.
Part 2: Conversion of the Dictionary
A MorphTrans-style Morphological dictionary text-format will be specified for the Dictionary XML files.
We will use the research paper available here for the same.
A tool will be developed in the Java for the conversion of the Morphological dictionary text-format files into the dictionary XML Files.
A tool will be developed using the xslts to convert the current dictionary XML files into the Morphological dictionary text-format dictionary files.
For the Metadix format it is also a form of XML, so to convert a Metadictionary in a text format, we will actually need only one XSLT, Yes the text format for the Meta dictionary will be different from the Text format for the Dictionary
To convert the Meta dictionary Text Format into the Dictionary XML format, we will need to first convert it into the Meta dictionary XML format and then applying the existing pre-processing to convert it into the Dictionary XML format.
Regression Testing tool Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.
Yes, we can update the Makefile so that if we a user have modified any of the formats during make the other format also get updated. But yes, you will need to call the make before compiling the dictionary to have the changes considered.
Work Timeline
Community Bonding Period :
Create a wiki page about the project so all the information about the project progress can be there.
Investaging more about the dictionary and the transfer rule files.
Gather up the resources needed to start this project.
Week 1:
Finalize the MorphTrans-style text-format for the transfer rules.
Week 2:
Develop the tool in Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML.
Week 3:
Develop the tool using the xslts to convert transfer rule XML files into the MorphTrans-style text-format rule file.
Updating the make file so both the formats is updated
Week 4:
Creating Regression Testing Tool Round trip checker and perform the validation for the both way of conversion.
Deliverable # 1 : The final tool for the both way conversion of the format of Transfer Rules
Week 5:
Finalize the Morphological dictionary text-format for the Dictionary Files.
Week 6:
Develop the tool in Java for the conversion of the Morphological dictionary text-format files into the dictionary XML.
Week 7:
Develop the tool using the xslts to convert dictonary XML files into the Morphological dictionary text-format rule file.
Updating the make file so both the formats is updated
Week 8:
Creating Regression Testing Tool Round trip checker and perform the validation for the both way of conversion.
Deliverable # 2 : The final tool for the both way conversion of the format of Dictionary Files
Week 9:
Finalize the Morphological dictionary text-format for the meta Dictionary Files.
Updating the tool that was creating for Dictionary Files to also provide the conversion of the Morphological dictionary text-format files into the meta dictionary XML.
Week 10:
Updating the tool that was creating for Dictionary Files to also convert also dictonary XML files into the Morphological dictionary text-format rule file.
Updating the make file so both the formats is updated
Week 11:
Creating Regression Testing Tool Round trip checker and perform the validation for the both way of conversion.
Week 12:
Final Documentation
List your skills and give evidence of your qualifications.
Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.
I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page. Web Resume Link
I work on the Linux operating system and have a good knowledge of command.
I have industrial experience of the two years to work on the project DALI for the Airbus that involves the processing of the large input XML that involves treating them, performing transformation, validation and generating desire output XML file.
I have done a project Creating a Indexing of the Wiki Data and providing search engine for the same with the help of Java and the XML Parsing.Git Hub Link
I have the good knowledge of writing the shell scripts as I have taken a course Scripting and the Computing Environment Git Hub Link
I have also worked on the project on python for creating a placement portal.Git hub link: Git Hub Link
I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: Git Hub Link
List non-Summer-of-Code plans you have for the Summer
Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.
I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time.
My classes will start from the 1st Aug'14 but as it will be the beginning of the session so there will be no impact on the project work due to classes and also at that time, we will be in the completing phase of the project so I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.