User:Cassarani/GSoC2010

From Apertium
Jump to navigation Jump to search

Why is it you are interested in machine translation?[edit]

I've been interested in machine translation ever since I first approached programming and computer science. I'm a native speaker of Italian and English and in the past I've often volunteered as a translator between the two languages, most notably for the open source desktop manager KDE (www.kde.org) and the online game Popmundo (www.popmundo.com). Being a computer scientist at heart, I've always dreamed of applying my linguistic knowledge to develop a valid alternative to the products I had the chance to try as a teenager (e.g. the infamous AltaVista BabelFish). In my first year at University, I started reading up on NLP and MT, mainly picking random books from the University library. I found "An introduction to machine translation" (1992) by Hutchins and Somers particularly insightful, if slightly out of fashion. Wanting to know more about the linguistic background to MT, I read "Chomsky's Universal Grammar: An Introduction" (2007) by Cook and Newson, which definitely shed a light on the theory behind modern-day MT.

Why is it that you are interested in the Apertium project?[edit]

I am interested in the Apertium project because I think open source projects can take full advantage of the combined linguistic knowledge of developers from all around the world. Whilst some open source projects may suffer from "design by committee" problems, I reckon that this is not the case for any projects in the field of NLP, where the expertise of bi- or multilingual contributors will tend to outperform commercial products that are naturally limited in the extent of their knowledge base. Another reason why I would love to contribute to Apertium is that I've been meaning to work in the field of NLP for a really long time, and this seems like the perfect project for me to start making a real difference and learn more about the subject matter.

Which of the published tasks are you interested in?[edit]

I am interested in working on writing a format filter to extend the range of file formats supported by Apertium. My work would focus on adding a filter for files marked up with LaTeX or MediaWiki. The latter could potentially be adapted to also support the Markdown formatting language (www.daringfireball.net/projects/markdown/), used by many other Wiki platforms other than MediaWiki.

What do you plan to do?[edit]

Project title: "Format filters for LaTeX and MediaWiki"

Work plan[edit]

Community bonding period: I will familiarise myself with Apertium and the superblanks system. In particular, I will look at how the pre-existing de- and reformatters work and why the approach taken with apertium-desmediawiki failed. At the same time, I will be looking at the implementation of pre-existing LaTeX and MediaWiki parsers from other open source projects. By the start of the coding phase, I will have identified what kind of approach can be adapted to the way Apertium handles format filters, including most of the format and/or replacement rules specific to the two markup languages, and the most likely priority thereof.

Week 1: Start work on apertium-deslatex, create the relevant files and specify the appropriate options; get familiar with the rules-based system and make sure I understand how everything works; write a few simple rules and test them on very basic input.

Week 2: Write the format rules for apertium-deslatex.

Week 3: Write the replacement rules for apertium-deslatex in order of priority.

Week 4: Conduct extensive testing of apertium-deslatex on both artificial and real-world input. Whilst I plan to test my work incrementally to make sure my approach isn't flawed, this will highlight any potential bugs that I may have previously overlooked.

Deliverable #1: A working implementation of apertium-deslatex.

Week 5 to 8: Follow the same schedule as weeks 1-4 for apertium-desmediawiki2 (temporary name to be replaced by one decided in accordance with my mentor). Learn from my experience writing apertium-deslatex and potentially change the order in which I write the format filter.

Deliverable #2: A working implementation of apertium-desmediawiki2.

Week 9 and 10: Go through the list of known bugs in apertium-deslatex and apertium-desmediawiki2 and fix them.

Week 11: Write any documentation that my mentor will require, maybe look at how to adapt apertium-desmediawiki2 to deal with Markdown input if there is enough time left.

Week 12: General clean-up and final evaluation.

I believe Google and Apertium should sponsor my project proposal because it would enable a future integration with products that support such commonly used markup languages as LaTeX and MediaWiki. In particular, the latter would make it possible to aid translators of Wikipedia (or indeed any other multilingual MediaWiki-based websites) by integrating Apertium into their editing workflow. In general, my work would allow the project to be adopted by users of new, previously unsupported platforms.


In the proposal, list your skills and give evidence of your qualifications. Tell us what is current field of study, major, etc.[edit]

I am currently (April 2010) in my first year of a four-year Computing degree at Imperial College London, UK. I have been programming since the age of 12 (I am now 20 years old), and I have experience with C/C++, Objective-C, Java, PHP, Haskell, Ruby and Python.

Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.[edit]

When I was 16 I contributed for about a year to the software development development and (en-it) translation of the open source project KDE, which gave me a really good taste of what it means to work with open source software. It was a really good experience and I got to know many interesting people that I am still friends with, but I have never been able to work on any other open source projects since, as I've always had summer jobs and therefore no spare time that I could devote to them. Google Summer of Code would allow me to make open source development my summer job, and it would ideally give me the chance to enter a community with the long-term goal of becoming a regular contributor to the project. In particular, having completed my format filter project, I would love to apply my bilingualism to improving the en-it module next, hopefully making the final adjustments that would allow it to be released.

Please list any non-Summer-of-Code plans you have for the Summer, especially employment and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.[edit]

My University-related obligations end on 18th June, but all my exams are well before the start of GSoC, and I only have a handful of lectures in the Summer term (May and June). I am moving houses on the first week of July, but I otherwise have no other commitments that will prevent me from finding at least 30 hours a week to devote to the project.