User:Dpenas/GSOC 2013 Application - Plain-text formats for application data

From Apertium
Jump to navigation Jump to search

Contact information[edit]

Name: Darío Penas Sabín

E-mail address: dario.penas[at]udc.es

A more private information can be provided to the mentor.

Why is that you are interested in machine translation?[edit]

I’m a computer engineering student and programming is one of the activities I have always enjoyed the most. Last year I’ve been involved in a natural language processing project using different tools such as NLTK and, even though it’s a really demanding area of study, I had a lot of fun learning new things.

Besides, I’m from Galicia, an autonomous community in the northwest of Spain with its own language and therefore I understand the problems and restrictions of less-spoken languages.

Why is that you are interested in the Apertium project?[edit]

I’m really amazed that an open source project like Apertium has been putting such a big effort on providing its services to almost every language they could, including those with a low number of speakers.It’s inspiring knowing that there will always be a group of people putting so much effort with “smaller” languages when big companies and even institutions don’t care. This reason, plus the reason I've written before, made Apertium an attractive option to get involved into the Google Summer of code.

Also, I’ve been an active free source user for over five years and I would love to participate in an open source project like this one and make my contribution to the community.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I’m planning to do the proposed idea: “Plain-text formats for Apertium data”.

Work that I have already done[edit]

I’ve used and installed Apertium so I could get used to how it works. Also, I’ve been connected to the IRC and I’ve subscribed to the emailing list to get to know the mentors and the community better. I’ve also read the papers from InterNostrum [1] and MorphTrans [2] as well as learning to use XSLT and making some little examples to get to know it.

During this process I’ve had some doubts about how the system worked and I contacted with the mentor Mikel Forcada, who provided me with useful information.

Work plan[edit]

- Previous weeks: I'll use these weeks to get to know more about the project and researching the best way of augmenting the current MorphTrans and expanding it to adapt it to the other .t2x and .t3x files. I'll discuss this with the mentor as well as with the whole community since it's something general that affects the whole project.

- Week 1-3: Since this will probably be the most difficult one to make, I'll spend around 3 weeks to implement the decision we have decided as well as documenting everything. By the end of the 3rd week I may have the first approximation of the compiler which allows to convert the .mt1, .mt2 and .mt3 Morphtrans' style files into XML.

- Week 3-4: Finishing the previous task and beginning the compiler that will do the contrary: take the XML and generate the Morphtrans' style input using XSLT.

- Deliverable #1: The XML to MorphTrans compiler and a beta version of the contrary one.

- Week 5: I would finish the MorphTrans to XML and finish the documentation.

- Week 6: Reading and discussing once again an optimal way to specify the alphabet for InterNostrum and, by the end of the week, start working on the that will convert the .dix files into interNostrum's style format.

- Week 7-8: Finishing that part and starting to do the contrary; converting the interNostrum's style format into .dix.

- Deliverable #2: The completed first compiler (from .dix into interNostrum's style format) and a "working beta" of the other one.

- Week 9-10: Work on correcting the errors of the "beta" compiler and finishing it by the end of the second week.

- Week 11-12: Finishing the documentation, final testing, etc.

- Extra tasks:

In case I'm able to finish the project sooner than expected I would be glad to start working with these other trasks:

- Migrating the project from SVN to git. Git is a really powerful tool that will allow the community to have a better organized project, dealing much faster and easier with the possible conflicts and having different branches of development. I would also provide the neccesary documentation for different people (beginners, average and advance) depending on their current knowledge of programming/svn/git, so the change would be smoother.

- During the development I would be in touch with a lot of the current code. I could comment it and update the documentation of it if I see any difference.

- I could extend/update the Spanish-Galician language-pair which the lastest update was the 10th of October of 2012 or the Galician-English one.

Skills and qualifications[edit]

I’m in the 4th year of computing engineering. I’m comfortable programming in C, Java and Python as well as using Bison and Flex. I’ve also done some things in Pascal, Fortran, Ocaml, Matlab and Coq. I’ve studied a Natural Language subject this year and I've developed with some friends a software that, given a simple question, looks it up on Google, evaluates the web results, and obtains the (more or less) correct answer(s) [3]. We've talked about continuing developing the idea in a future to obtain better results.

I've also worked with compilers and I've programmed some things using flex/bison that can be found here [4] and here [5].

Non-Summer-of-Code plans[edit]

Google Summer of Code would be my main plan for the whole summer. I finish the university's exams the 7th of June, so I might loose some time the first week to get to know the mentor, read the documentation etc. However, I'll be able to dedicate around 30 to 40 hours that week to the project.