User:Jmcejuela/GSoC11Application

From Apertium
Jump to navigation Jump to search

I am a Master Computer Science student at Technical University of Munich (TUM), currently in my fourth-last semester and about to start my Master Thesis. As I announced in the mailing list my intention is to combine into the same endeavor both my thesis and the GSoC project (possible both from my university and from Google) I desire such a combination because I want to do both but due to the entire overlap between them, considering the European/German academic calendar, it would very difficult to do them independently for both require full-time commitment.

Having a solid background in transducers and their mathematical foundations, for my project I want to work extensively on transducers and this is my highest motivator. Coming from a more training/learning world, being Apertium rule-based, and also considering that my thesis should expand the work of the GSoC project to comply with a master thesis's higher effort/academic requirements (exactly 6 months at TUM), for my project I expand and elaborate further on an idea discussed with Jimregan on the use of transducers in replacement of flag diacritics, as used in hfst, and include a part for automatic topology learning to generate such transducers. Furthermore, I suggest my own idea which involves mostly topology learning and weight training using one of the corpus you list in your corpora page, the Southeast European Times for considering it particularly interesting due to its aligned structure for multiple languages.

The organization for such a combined thesis/project if you accepted my proposal (one thereof) would be probably as follows: Hasan Ibne Akram would be my official advisor for my thesis at TUM, while one of you would be my official mentor for the GSoC project. Please tell me if you wanted to be also my official thesis advisor; we would have to discuss such possible arrangement.

  • Name: Juan Miguel Cejuela
  • Email: juanmi@jmcejuela.com
  • Citizenship: Spanish, European Union
  • Location: Munich, Germany
  • Position: MSc Computer Science student at Technical University of Munich.
  • irc, skype, twitter, ...: jmcejuela


Why is it you are interested in machine translation?

As my background & skills show, see below, I've followed a work/research that directly conduct me to this. Despite not having yet worked directly in machine translation, I've had for many years a strong desire in it, and now I'd love to invest the effort and time of my master thesis to finally get dirty with it. I'm well acquainted with many tools that are used in machine translation, including transducers, automata, HMMs, grammatical parsers, programming languages parsers, text mining, stemmers, string edit distance algorithms, fuzzy logic...

Besides, I'm myself an avid language learner and currently speak Spanish, English, and German ---apart of programming languages, of course. I find languages fascinating for they frame and make possible communication, both between humans, computers, and maybe one day humans-computers. Also and although, as analogy with the computer science world, all languages are Turing machine complete, in practice it's extremely different how to convey different ideas in different languages, and some languages are best suited for particular concepts. Furthermore, the well understanding and translation of languages plays a crucial role in the development of this already globalized world.

I want to grasp a better understanding of languages in general, how they work, how machines can process them or even understand them, and finally how it can be possible to have human-like machine translation and natural language processing. Morever, I want to continue working on transducers, for which lately, due to my work with them, I've obtained a certain degree of expertise with them, and I would like to play with them in real applications.


Why is it that you are interested in the Apertium project?

I've just known recently the Apertium project and I'm still studying it but my first impressions are very good. Because:

  • It's a (medium-size) open source project, and that means: open discussions and community critical review/thinking, contribution to others beyond this project. I'm sure I will learn a lot.
  • The wiki so far appeared to me just great. Well documented.
  • I've seen there is people here from several different backgrounds and cultures/countries. It seems a lot of fun.
  • Technically, I like the Apertium's interest for less popular languages for which less research is available. That means more fun. And yes, easier to publish ;)
  • Being Spanish, it interests me a lot that the project is funded by the Spanish government and is Spain-based.


My only but is that you don't use statistical learning. Very bad ;) In any case, I've worked little before with rule-based systems and so with this project I will be able to learn and understand that approach. And maybe discover you're right :S Or partially.


Which of the published tasks are you interested in? What do you plan to do?

TODO: Being obliged to expand the GSoC project, I try to delimit officially both things as far as it's me possible and see now both things.

Include a proposal, including

   * a title,
   * reasons why Google and Apertium should sponsor it,
   * a description of how and who it will benefit in society,
   * and a detailed work plan (including, if possible, a brief schedule with milestones and deliverables).

Include time needed to think, to program, to document and to disseminate.


Background & Skills

As I've listed in the first section I have a rich experience with multiple staple tools used for machine translation: transducers, automata, HMMs, grammatical parsers, programming languages parsers, text mining, stemmers, string edit distance algorithms, fuzzy logic...

Specifically for transducers, I've recently worked in a seminar on EM Training for Weighted Transducers and I'm about to publish a paper describing my novel conversion of such an algorithm to log space to be able to work with it on a machine in practice. This is not trivial, since both sums and vector operations are involved in the algorithm.

As for open source projects, my biggest contributions are so far:

  • CL-HMM: a HMM library in Common Lisp written from scratch by me that was the work of my bachelor thesis, at Aarhus Universitet. The library was gonna be used in BioLisp at Berkley, but unfortunately the group has ceased operation since 2009.
  • Small contribution to the Anki project, with some plugins, and some scripts for the the sister Ankidroid project.


For more information, please see my CV/Résumé.


Other Commitments

I have for the following 6/7 months no other important commitment and I will focus entirely on my thesis/project.