User:Kanmuri/GSoC 2010 Application/Java Runtime Port
Name: Stephen Tigner
E-mail address: stephen.tigner@gmail.com
Other information that may be useful to contact you: Kanmuri on Freenode IRC (Will provide other contact information directly and privately to project mentors. I don't wish to post such publicly, however.)
Why is it you are interested in machine translation?
The promise of machine translation (along with the internet) is that eventually all the world's information and knowledge will be available to everyone, regardless of location, nationality, or language. I want to help fulfill that promise, as it were.
Also, being able to communicate with people across the language barrier, breaking it down, as it were, interests me. I'm certain that there are those who share my same interests throughout the world, but we may never be able to connect with each other due to the language barrier.
Why is it that you are interested in the Apertium project?
Well, because I'm trying to decide what field in computer science I want to pursue further. One field that had somewhat piqued my interest was computational linguistics, but I know pretty much nothing about linguistics, so I hoped to learn during the process of working with Apertium, to help me know if it's really the path I want to pursue or not.
Also, again, like I said above, Apertium is working toward fulfilling the promise of machine translation.
Qualifications and previous experience
I have my BS in CS (I did my final project in Java), and am currently studying abroad in Japan for a year, returning to home (the US) in late May. My programming experience has been mostly in-class stuff and small utilities or scripts I've written for my own use.
One exception was a "MUCK emulation" layer for a MOO. Basically it was creating a set of utilities that were written from scratch to emulate the operation as well as look and feel of some of the most common commands and utilities on MUCKs. This involved digging into the MOO core, understanding how it worked, and extending it as needed to support the MUCK style of operation, while not breaking the existing MOO-style utilities for users who wanted to use them.
(If you don't know what I'm talking about, that's fine. They're basically text-based shared user environments. MOO is an object-oriented language for creating these shared environments. The term refers to both the language and the environment itself. MUDs, MUCKs, MUSHes, MOOs, etc., are the predecessors of today's MMOs, and many still see somewhat active use even today.)
As far as languages relevant to this project go, I do have experience in both Java and C++.
My final project for my CS degree was written in Java. It was a simulator of an automated sports field painter (configured for a soccer field), with an implementation of a text-based communications protocol that would have been used for the real device if we had been able to build a prototype.
For C++, I took courses that involved using the language and focused alot on the differences and gotchas for those used to Java. I also used C++ for my programming projects in my OpenGL graphics classes that I took as electives.
As for open-source contributions, I haven't really contributed to open-source projects before, unfortunately.
Which of the published tasks are you interested in? What do you plan to do?
Java port of Apertium runtime
I'm interested in this one for several reasons.
1) This would make it much easier to run on non-*nix platforms (like Windows).
2) Having it in Java means it could potentially be used as a library for applications in the mobile or embedded space on devices that support Java.
3) Java is the language I feel I am the most comfortable with right now.
4) There are many others like myself who are more comfortable with, and more likely to succeed at, experimenting with Java code than C++. Thus this would also allow for and encourage more people to contribute to improving the actual Apertium engine at a low level.
Work plan:
Over spring break, I started reading the Apertium documentation on the wiki, and completed a code challenge posted on the GSoC wiki page that used dixtools as a library to implement a tool that checks for invalid stems. I won't have time for much additional coding until school gets out, however. (It can currently be found at: http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-dixtools/stem-checker/ Sorry if it got stuck somewhere weird. ^^; )
Preliminary Schedule (Week of [Monday]): This schedule is based on feedback from mentors about what is expected for the project. I have taken time estimates given me and roughly doubled or tripled them.
Community Bonding Period
HMM/tagger Understanding the principles of HMM in tagging, understand how it's coded.
May
May 24th: Not available (will be returning from Japan, settling back in, and then attending an event)
May 31st: Tagger: Debugging the C++ code with a C++ debugger and the Java code with a Java debugger, making sure the Java code does the same thing line for line. / Fixing errors in the original ported draft
June
June 7th: Continuation of the previous week.
June 14th: Continuation of the previous two weeks
June 21st: Regression testing/QA of tagger, making sure it's 100% compatible (testing Java version and C++ version on a range of .prob files on a corpus)
June 28th: Deformatter / reformatter for text
July
July 5th: Porting pretransfer.
July 12th: Mid-term evaluation, Porting interchunk
July 19th: Porting postchunk
July 26th: Regression testing/QA of pretransfer, interchunk, postchunk, making sure its 100% compatible (testing Java version and C++ version on a range of inputs
August
Aug 2nd: Continue regression testing
Aug 9th: Continue regression testing; clean up code for final release
Aug 16th: Everything finished by this date ("pencils down" date). Final evaluation.
Extra tasks at the end if there is time:
- Deformatter for HTML
- Optimization
- Generating a custom prepared .jar package per mode, so a language pair can be 'shipped' as a JAR file with all included
- Parsing .modes files and doing the piping in Java instead of as a bash script
- Finding parts of pipe process that can be done in-process in Java, and doing in-process instead of spawning a subprocess