Difference between revisions of "User:Ergaurav3/GSOC Application2:Plain-text formats for Apertium data"
(18 intermediate revisions by 2 users not shown) | |||
Line 8: | Line 8: | ||
'''GitHub:''' https://github.com/ergaurav2 |
'''GitHub:''' https://github.com/ergaurav2 |
||
'''SourceForge''' : ergaurav2 |
|||
'''WebLink''' : http://web.iiit.ac.in/~gaurav.agrawal/ |
|||
---- |
---- |
||
Line 39: | Line 43: | ||
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. |
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. |
||
The current way of writing the dictionary and the |
The current way of writing the dictionary and the transfer rule is in the format of xml which may be difficult for some developers and make the development difficult and time consuming for them. |
||
So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules. |
So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules. |
||
Line 70: | Line 74: | ||
Same is available on the github. Link: [https://github.com/ergaurav2/apertium-plain-text-format-coding-challenge GitHub Link] |
Same is available on the github. Link: [https://github.com/ergaurav2/apertium-plain-text-format-coding-challenge GitHub Link] |
||
As a part of Coding Challenge [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Plain-text_formats_for_Apertium_data] I have write the parser to convert a *.mode shell-script fragment into a modes.xml file. To the addition to the Coding Challenge I have also write a parser to convert the modes.xml file into the *.mode frgament. |
|||
As suggested by the Mikel, It is good to see the Coding Challenge for the Project Unify the Metadix Formats, I have already attempted that coding challenge. Available on the Github. [https://github.com/ergaurav2/apertium-unify-metadix-coding-challenge GitHub Link] |
As suggested by the Mikel, It is good to see the Coding Challenge for the Project Unify the Metadix Formats, I have already attempted that coding challenge. Available on the Github. [https://github.com/ergaurav2/apertium-unify-metadix-coding-challenge GitHub Link] |
||
Line 79: | Line 85: | ||
The project is majorly consist of two parts: |
The project is majorly consist of two parts: |
||
1. The conversion of the the transfer rules |
1. The round conversion of the the transfer rules. |
||
2. The conversion of the dictionary files. |
2. The round conversion of the dictionary files. |
||
==== Problem with Current Scenario: ==== |
==== Problem with Current Scenario: ==== |
||
Line 91: | Line 98: | ||
'''Part 1: Conversion of the Transfer Rules''' |
'''Part 1: Conversion of the Transfer Rules''' |
||
<> |
|||
A MorphTrans-style text-format will be specified for the Transfer rules XML files. |
A MorphTrans-style text-format will be specified for the Transfer rules XML files. |
||
We will be using the research paper available [http://www.internostrum.com/docum/morphtrans.ps here] for the same. |
|||
A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML Files. |
A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML Files. |
||
::: Are there any things not covered by the MorphTrans format that need taking care of? --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:42, 21 March 2014 (UTC) |
|||
<I will anlayze the MorphTrans Further and try to figure out the same |
|||
A tool will be developed using the xslts to convert the current transfer rule XML files into the MorphTrans-style text-format rule files. |
A tool will be developed using the xslts to convert the current transfer rule XML files into the MorphTrans-style text-format rule files. |
||
Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another. |
|||
::: It should be a round-trip check rather than regression testing. --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:44, 21 March 2014 (UTC) |
|||
'''Part 2: Conversion of the Dictionary''' |
'''Part 2: Conversion of the Dictionary''' |
||
Morphological dictionary text-format will be specified for the Dictionary XML files. |
|||
We will use the research paper available [http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf here] for the same. |
|||
::: Morphtrans is a format for structural transfer rules, not for dictionaries. Connect this to the references given above --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:42, 21 March 2014 (UTC) |
|||
⚫ | |||
⚫ | |||
::: Do you think a single XSLT stylesheet will do or will multi-pass be necessary as with Metadix? --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:42, 21 March 2014 (UTC) |
|||
For the Metadix format it is also a form of XML, so to convert a Metadictionary in a text format, we will actually need only one XSLT, Yes the text format for the Meta dictionary will be different from the Text format for the Dictionary. |
|||
To convert the Meta dictionary Text Format into the Dictionary XML format, we will need to first convert it into the Meta dictionary XML format and then applying the existing pre-processing to convert it into the Dictionary XML format. |
|||
Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another. |
|||
::: It should be a round-trip check rather than regression testing. --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:44, 21 March 2014 (UTC) |
|||
⚫ | |||
::: As a cherry on the cake, is there any way you could, in an application, keep both formats updated when someone is editing the simpler format? --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 06:44, 21 March 2014 (UTC) |
|||
⚫ | |||
Yes, we can update the Makefile so that if we a user have modified any of the formats during make the other format also get updated. But yes, you will need to call the make before compiling the dictionary to have the changes considered. |
|||
Regression Testing tool to validate there is no error during the conversion from one form to another. |
|||
== Work Timeline == |
== Work Timeline == |
||
Line 133: | Line 164: | ||
Develop the tool using the xslts to convert transfer rule XML files into the MorphTrans-style text-format rule file. |
Develop the tool using the xslts to convert transfer rule XML files into the MorphTrans-style text-format rule file. |
||
Updating the make file so both the formats is updated. |
|||
'''Week 4:''' |
'''Week 4:''' |
||
Creating |
Creating Round trip checker and perform the validation for the both way of conversion. |
||
'''''Deliverable # 1''' : The final tool for the both way conversion of the format of Transfer Rules'' |
'''''Deliverable # 1''' : The final tool for the both way conversion of the format of Transfer Rules'' |
||
Line 142: | Line 175: | ||
'''Week 5:''' |
'''Week 5:''' |
||
Finalize the |
Finalize the Morphological dictionary text-format for the Dictionary Files. |
||
'''Week 6:''' |
'''Week 6:''' |
||
Develop the tool in Java for the conversion of the |
Develop the tool in Java for the conversion of the Morphological dictionary text-format files into the dictionary XML. |
||
'''Week 7:''' |
'''Week 7:''' |
||
Develop the tool using the xslts to convert dictonary XML files into the |
Develop the tool using the xslts to convert dictonary XML files into the Morphological dictionary text-format rule file. |
||
Updating the make file so both the formats is updated |
|||
'''Week 8:''' |
'''Week 8:''' |
||
Creating |
Creating Round trip checker and perform the validation for the both way of conversion. |
||
'''''Deliverable # 2''' : The final tool for the both way conversion of the format of Dictionary Files'' |
'''''Deliverable # 2''' : The final tool for the both way conversion of the format of Dictionary Files'' |
||
Line 162: | Line 197: | ||
'''Week 9:''' |
'''Week 9:''' |
||
Finalize the |
Finalize the Morphological dictionary text-format for the meta Dictionary Files. |
||
Updating the tool that was creating for Dictionary Files to also provide the conversion of the |
Updating the tool that was creating for Dictionary Files to also provide the conversion of the Morphological dictionary text-format files into the meta dictionary XML. |
||
'''Week 10:''' |
'''Week 10:''' |
||
Updating the tool that was creating for Dictionary Files to also convert also dictonary XML files into the |
Updating the tool that was creating for Dictionary Files to also convert also dictonary XML files into the Morphological dictionary text-format rule file. |
||
Updating the make file so both the formats is updated |
|||
'''Week 11:''' |
'''Week 11:''' |
||
Creating |
Creating Round trip checker and perform the validation for the both way of conversion. |
||
'''Week 12:''' |
'''Week 12:''' |
||
Line 184: | Line 221: | ||
Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects. |
Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects. |
||
I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page |
I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page. [http://web.iiit.ac.in/~gaurav.agrawal/ Web Resume Link] |
||
I work on the Linux operating system and have a good knowledge of command. |
I work on the Linux operating system and have a good knowledge of command. |
||
Line 197: | Line 234: | ||
I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: [https://github.com/ergaurav2/OS Git Hub Link] |
I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: [https://github.com/ergaurav2/OS Git Hub Link] |
||
== List non-Summer-of-Code plans you have for the Summer== |
|||
== List non-Summer-of-Code plans you have for the Summer== |
== List non-Summer-of-Code plans you have for the Summer== |
||
Line 202: | Line 241: | ||
Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project. |
Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project. |
||
I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time. |
I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will working full time on the project and will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time. |
||
My classes will start from the 1st Aug'14 |
My classes will start from the 1st Aug'14 (only around 2-3 hours will be spend in classes during weekdays) and as it will be the beginning of the new session, there will be no impact on the project work due to classes and we will be in the completing phase of the project. I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project. |
||
[[Category:GSoC 2014 Student proposals|Ergaurav3]] |
|||
== Comments/Suggestion/Feedback == |
Latest revision as of 14:22, 21 March 2014
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Work plan
- 6 Work Timeline
- 7 List your skills and give evidence of your qualifications.
- 8 List non-Summer-of-Code plans you have for the Summer
- 9 List non-Summer-of-Code plans you have for the Summer
Contact Information[edit]
Name: Gaurav Agrawal
Email: ergaurav2@gmail.com
IRC: ergaurav2
GitHub: https://github.com/ergaurav2
SourceForge : ergaurav2
WebLink : http://web.iiit.ac.in/~gaurav.agrawal/
Why is it you are interested in machine translation?[edit]
Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.
Human Translators are not very feasible and available.
It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.
Why is it that you are interested in the Apertium project?[edit]
As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Title[edit]
Plain-text formats for Apertium data
Why Google and Apertium should sponsor it[edit]
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. The current way of writing the dictionary and the transfer rule is in the format of xml which may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.
How and who it will benefit in society,[edit]
This will make the development of the transfer rule and dictionary easier for many developers and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.
Work plan[edit]
What work I have already done ?[edit]
I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.
Installation of the apertium, joining the mailing list, IRC, source forge.
Working with the community members since last months with the help of IRC and mailing list
I have read the research paper for the InterNostrum and MorphTrans [1] [2] and understand the format.
Coding Challenge[edit]
I have successfully completed the challenge and also reviewed by the mentor Mikel.
Same is available on the github. Link: GitHub Link
As a part of Coding Challenge [3] I have write the parser to convert a *.mode shell-script fragment into a modes.xml file. To the addition to the Coding Challenge I have also write a parser to convert the modes.xml file into the *.mode frgament.
As suggested by the Mikel, It is good to see the Coding Challenge for the Project Unify the Metadix Formats, I have already attempted that coding challenge. Available on the Github. GitHub Link
Project Understanding[edit]
The project is majorly consist of two parts:
1. The round conversion of the the transfer rules.
2. The round conversion of the dictionary files.
Problem with Current Scenario:[edit]
Currently we have both the transfer rules and the dictionary files in the form of the xml.
Many developers are comfortable with these xml formats but some found it more easier to write the data in the text-formats.
Solution/Approach:[edit]
Part 1: Conversion of the Transfer Rules <> A MorphTrans-style text-format will be specified for the Transfer rules XML files.
We will be using the research paper available here for the same.
A tool will be developed in the Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML Files.
<I will anlayze the MorphTrans Further and try to figure out the same
A tool will be developed using the xslts to convert the current transfer rule XML files into the MorphTrans-style text-format rule files.
Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.
Part 2: Conversion of the Dictionary
Morphological dictionary text-format will be specified for the Dictionary XML files.
We will use the research paper available here for the same.
A tool will be developed in the Java for the conversion of the Morphological dictionary text-format files into the dictionary XML Files.
A tool will be developed using the xslts to convert the current dictionary XML files into the Morphological dictionary text-format dictionary files.
For the Metadix format it is also a form of XML, so to convert a Metadictionary in a text format, we will actually need only one XSLT, Yes the text format for the Meta dictionary will be different from the Text format for the Dictionary.
To convert the Meta dictionary Text Format into the Dictionary XML format, we will need to first convert it into the Meta dictionary XML format and then applying the existing pre-processing to convert it into the Dictionary XML format.
Round trip check tool in we will convert from text to xml and then xml to back text or vice-versa depending on which form we have modified including also the validation of xml to validate there is no error during the conversion from one form to another.
Yes, we can update the Makefile so that if we a user have modified any of the formats during make the other format also get updated. But yes, you will need to call the make before compiling the dictionary to have the changes considered.
Work Timeline[edit]
Community Bonding Period :
Create a wiki page about the project so all the information about the project progress can be there.
Investaging more about the dictionary and the transfer rule files.
Gather up the resources needed to start this project.
Week 1:
Finalize the MorphTrans-style text-format for the transfer rules.
Week 2:
Develop the tool in Java for the conversion of the MorphTrans-style text-format files into the transfer rule XML.
Week 3:
Develop the tool using the xslts to convert transfer rule XML files into the MorphTrans-style text-format rule file.
Updating the make file so both the formats is updated.
Week 4:
Creating Round trip checker and perform the validation for the both way of conversion.
Deliverable # 1 : The final tool for the both way conversion of the format of Transfer Rules
Week 5:
Finalize the Morphological dictionary text-format for the Dictionary Files.
Week 6:
Develop the tool in Java for the conversion of the Morphological dictionary text-format files into the dictionary XML.
Week 7:
Develop the tool using the xslts to convert dictonary XML files into the Morphological dictionary text-format rule file.
Updating the make file so both the formats is updated
Week 8:
Creating Round trip checker and perform the validation for the both way of conversion.
Deliverable # 2 : The final tool for the both way conversion of the format of Dictionary Files
Week 9:
Finalize the Morphological dictionary text-format for the meta Dictionary Files.
Updating the tool that was creating for Dictionary Files to also provide the conversion of the Morphological dictionary text-format files into the meta dictionary XML.
Week 10:
Updating the tool that was creating for Dictionary Files to also convert also dictonary XML files into the Morphological dictionary text-format rule file.
Updating the make file so both the formats is updated
Week 11:
Creating Round trip checker and perform the validation for the both way of conversion.
Week 12:
Final Documentation
List your skills and give evidence of your qualifications.[edit]
Tell us what is your current field of study,major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.
I am a student of Master in Computer Science and Engineering with the interest in the NLP and the machine translation from IIIT, Hyderabad. You can find my resume on web page. Web Resume Link
I work on the Linux operating system and have a good knowledge of command.
I have industrial experience of the two years to work on the project DALI for the Airbus that involves the processing of the large input XML that involves treating them, performing transformation, validation and generating desire output XML file.
I have done a project Creating a Indexing of the Wiki Data and providing search engine for the same with the help of Java and the XML Parsing.Git Hub Link
I have the good knowledge of writing the shell scripts as I have taken a course Scripting and the Computing Environment Git Hub Link
I have also worked on the project on python for creating a placement portal.Git hub link: Git Hub Link
I have also the working knowledge of the C/C++, you can see some of work done by me as a part of course Operating System.Github Link: Git Hub Link
List non-Summer-of-Code plans you have for the Summer[edit]
List non-Summer-of-Code plans you have for the Summer[edit]
Especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.
I have my summer holidays from the 12th May'14 to the 27th July'14 which will main period to work on the project so I will working full time on the project and will be able to devoted alteast 40 hours to the project or more depending on the need arises with the time.
My classes will start from the 1st Aug'14 (only around 2-3 hours will be spend in classes during weekdays) and as it will be the beginning of the new session, there will be no impact on the project work due to classes and we will be in the completing phase of the project. I can assure you during this period also, I will be able to devote atleast around 40 hours towards the project.