User:Mikel/GSoC 2012 Application

From Apertium
< User:Mikel
Revision as of 19:15, 1 April 2012 by Mikel (talk | contribs) (Created page with '== Contact information == '''Name''': Mikel Artetxe '''E-mail address''': artetxem@gmail.com ''Other contact information would be provided privately to the mentor.'' == Why i…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Contact information

Name: Mikel Artetxe

E-mail address: artetxem@gmail.com

Other contact information would be provided privately to the mentor.

Why is that you are interested in machine translation?

Computing and languages are two of my passions, and machine translation is where both fields meet.

In particular, I am a native speaker of Basque, a minor isolated language with an unknown origin, and I have always been fascinated by all the peculiarities it has in contrast with the languages surrounding it. At the same time, I am aware of the challenge that the preservation of this sort of minor languages represents in an increasingly globalized world. As a technology enthusiast, I consider that machine translation can play an important role on that, providing a way to break these barriers so that any language can have its place in the world.

Why is that you are interested in the Apertium project?

First of all, I admire the emphasis that the Apertium project puts on minor languages since, as I said before, I think that adapting these minor languages to our times is the key to guaranteeing their preservation and machine translation is an important field for that. In this respect, Apertium was the first software to provide a Basque-Spanish and Basque-English translation system, so it is also a close project for me in that sense.

At the same time, I support free software, although I have never actively taken part in an open source project. Therefore, I consider this to be an excellent opportunity to become a part of this beautiful world by contributing to something that I have benefited from as a user.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in the task named “Make lttoolbox-java embeddable”.

Currently, lttoolbox-java is only usable from the command line, and it relies on external resources of the language pair to be translated (which must be downloaded and compiled by the user separately). The aim of this task would be to overcome this so that we could have self-contained JAR files to translate a language pair that could easily be integrated in larger Java projects.

Specifically, I would first be adapting lttoolbox-java so that it can directly use embedded resources (note that, for the coding challenge, the solution that I adopted was to copy them to a temporary folder, but dealing with them directly is a more reasonable long-time approach), and working on an easy to use and maintain solution to offer this functionality to external programs and end-users. Later, I would work on the embeddability for different platforms, namely, mobile platforms (paying special attention to Android), servlets and applets. A small parenthesis of two weeks is planned as well to work on the embeddability of the C++ version, focusing on iOS.

Why should Google and Apertium sponsor it? How and who will it benefit in society?

I think that this project would notably help Apertium get closer to the general public. Currently, making Apertium work on a machine is a pretty complicated task for a newcomer, especially under Windows. In this respect, providing a JAR file that would just work by double-clicking on it with the only prerequisite of having a JVM installed would mean a much friendlier way to get into it. Additionally, it would open the doors to integrating Apertium into different platforms and devices such as the increasingly popular smartphones and tablets in which the availability of an offline translator would be awesome.

Work that I have already done

I have solved the coding challenge proposed at the Apertium wiki, which consisted of developing a self-contained JAR executable to translate a specific language pair (Basque-English was my choice). The solution adopted basically consisted of copying the embedded resources of the language pair into a temporary directory on runtime. The executable JAR file can be downloaded from here and the source from here.

I also worked on a prototype iOS app prior to that, since my first idea was to work on bringing Apertium to iOS (afterwards I realized that the task named “Apertium on your mobile” was mainly focusing on Android and moved to this other task that better suited my skills). Regarding this, I successfully compiled Apertium for iOS solving all of its dependencies, and saw the challenges that the development of such task would entail. In general terms, the prototype app worked correctly, although something was amiss at the transfer stage which led to empty translations.

All of this allowed me to familiarize myself with Apertium and lttoolbox-java and to see the challenges that the task for which I am applying would involve. At the same time, I contacted my would-be mentor, Jacob Nordfalk, and we talked about which approach to follow and decided on a work plan.

Work plan

- Week 1-3: Adapt lttoolbox-java so that it can directly work with embedded files without the need to copy them to a temporary directory as in the solution proposed for the coding challenge.

- Week 3-4: Make an API class that would easily allow the translation of an embedded language pair. Work on a demo JAR executable usable from the command line that would make use of it with a specific language pair. Time permitting, work on an API class that would allow access to the intermediary translation stages. At the same time, decide, in conjunction with the Apertium organization, on how to organize easily downloadable precompiled language pairs for users (SourceForge, website…) and easy maintenance for developers (makefile, script…).

- Deliverable #1: The above mentioned JAR executable.

- Week 5: Implement what has been decided in week 4 with the Apertium community.

- Week 6: Make a small user-oriented GUI application for translation (something similar to apertium-tolk). The idea is that any user with the only prerequisite of having JVM installed could easily use it. It would consist of an applet or, preferably, java web start.

- Week 7-8: Work on the embeddability of the C++ version of Apertium and lttoolbox, focusing on iOS.

- Deliverable #2: The application developed in week 6 and what has been produced regarding the iOS port (code, documentation and, hopefully, a working app).

- Week 9-11: Work on mobile embeddability (paying special attention to Android) as well as on servlets and applets. If the work regarding iOS has been fruitful, we could consider taking some more time on it during these weeks should it be worth it.

- Week 12: Suggested "pencils down" date: write documentation, test everything, etc.


Extra tasks (these wouldn't be planned, but I could be working on them if I finish before expected):

- Integration of the JAR files with apertium-viewer.

- Investigate about the possibility of reducing loading time by memmaping techniques.

Skills and qualifications

I am a second year undergraduate in Computer Engineering at the University of the Basque Country.

Java is the language which I learned programming with before entering university, as well as the main programming language that we have been working with in college courses and so it is the language that I feel most comfortable with. Moreover, I developed a Java program for language learning for Morris Academy as a personal project together with Mikel Morris (author, among others, of the Morris Magnum English-Basque dictionary, the largest bilingual dictionary in Basque to date, as well as the most complete online dictionary available on the Basque Government website at http://www.euskadi.net/morris/). Since last summer, we have been working in our spare time to create an iOS app that adds many other functionalities to it as part of the publishing arm of Morris Academy. Thanks to this, I have learnt about mobile development in general and about iOS in particular.

Apart from the above mentioned Java and Objective-C programming languages, I have learnt Ada and C at university. Additionally, I have been studying C++ by my own. More related to machine translation, I have been taking a course that dealt with the automata theory and finite state machines. During the course, I worked on a project to extend JFLAP, basically adding the capability of automatically generating automatas that corresponded to the result of operations between the languages denoted by other existing automatas.

Unfortunately, I have never been working on an open source project, except for the above mentioned extension of JFLAP (but even in that occasion, I worked on my own and not together with a community since the purpose of the work was merely academic). In fact, this is one of my motivations to apply for GSoC since I am really interested in open source development and this provides an excellent opportunity to get to know this world.

Non-Summer-of-Code plans

Google Summer of Code would be my main plan for this summer. However, I might be a bit busy during the first two weeks of the program (and, especially, during the first one) because I will be taking some exams at university. In any case, I still expect to have a minimum of 30-40 hours for GSoC during these two weeks.