User:Ergaurav3/GSOC Application1:Unify the metadix formats
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Title
- 6 Why Google and Apertium should sponsor it
- 7 How and who it will benefit in society,
- 8 Work plan
Contact Information
Name: Gaurav Agrawal
Email: ergaurav2@gmail.com
IRC: ergaurav2
GitHub: https://github.com/ergaurav2
Why is it you are interested in machine translation?
Machine Translation is the only field that actually make the communication possible among the different part of the world. We are being able to communicate from one part of the world and understand each other just because of the same.
Human Translators are not very feasible and available.
It is most interesting and emerging field with a lot of work to do and lot to learn.This field not only have impact on a part of society but on the whole world.
Why is it that you are interested in the Apertium project?
As I have interest in the machine translation, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributers.
Which of the published tasks are you interested in? What do you plan to do?
Title
Unify the Metadix Formats
Why Google and Apertium should sponsor it
The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. There is variation among the meta dictionary in the language pair which is increasing with the addition of the new language pair. This variation is getting more complex and the different metadix formats are kind of depend on the developer of the language pair. So, there is need of removing these complexity and the variation to make the things simple for the system.
How and who it will benefit in society,
This will make the use of the language pair that involves the meta dictionary formats easier and also easy to understand and contribute. So, this will help the user of the apertium project and also the contributor. Definetly this will help the whole community to grow far.
Work plan
What work I have already done ?
I have been involved in the apertium since last one month and have a lot of discussion about the project and the coding challenge on the IRC channel and the mailing list.
Installation of the apertium, joining the mailing list, IRC, source forge
Working with the community members since last months with the help of IRC and mailing list
Understanding of the dictionary formats
Understading of the different type of Metadictionary formats
Role of the alternative/variant (alt) in the dictionary.
Coding Challenge:
I have successfully completed the challenge and it is also reviewed by the mentor. Same is available on the github.
Finding about the topics in the project:
Metadictionary Formats: We need to apply the pre-processing by the xslt files to convert the metadix
The formats en-ca, en-es, fr-es, oc-ca, oc-es contains the metadix format files.
The metadix are converted into the dictionary format with the use of basically two xslts file buscPar.xsl and the principal.xsl.
buscPar.xsl : It is basically used to do the sorting and the creation of intermediate xsl file with list of verbs that use metaparadigms and also we perform the uniq to delete the duplicates.
principal.xsl: This stylsheet actually contains the rule for the conversion of the <par> and the <sa> tags to form the dictionary format (.dix).
There are two types of the meta dictionary files (metadix) files:
One contains the <par> tags and the other contains the <sa> tags
<sa/> tags: en-ca, en-es (en.metadix)
<prm/> tags: br-fr, fr-es, oc-ca, oc-es (fr.metadix, oc.metadix)
The tag <prm/> is a marker that is used to place variable text part in the paradigm definition. The tag <sa> is placed where an optional grammatical symbol (tag) should appear.
For the case of <sa/>, we have: <e lm="time"> time <par n="house__n" sa="unc"/> </e>
In the pardef, <sa/> will be replaced by the <s n="unc">
For the case <prm/>, if we have something like:
<e lm="acu ́lher"> e acu <par n="m/ ́[T]er__vblex" prm="lh"/> e </e>
In the pardef, <prm/> will be replaced by the string "lh". Alternative/Variant (“alt” or the “v” attribute):
I have considered three language pair for looking into the behaviour of the alt and the v attribute: oc-ca, es-ca and the es-pt
The alt.xsl is used for the conversion using the “alt” attribute.
The filter.xsl is used for the conversion using the “v” attribute.
For the oc-ca we don't have the filter.xsl all the things are done with only the alt.xsl
For the es-pt we don't have the alt.xsl and all the things are done using only the filter.xsl
For the es-ca we have the both alt.xsl and the filter.xsl.