User:Jmcejuela/GSoC11Application

From Apertium
Jump to navigation Jump to search

I am a Master Computer Science student at Technical University of Munich (TUM), currently in my fourth-last semester and about to start my Master Thesis. As I announced in the mailing list my intention is to combine into the same endeavor both my thesis and the GSoC project (possible both from my university and from Google) I desire such a combination because I want to do both but due to the entire overlap between them, considering the European/German academic calendar, it would very difficult to do them independently for both require full-time commitment.

Having a solid background in transducers and their mathematical foundations, for my project I want to work extensively on transducers and this is my highest motivator. Coming from a more training/learning world, being Apertium rule-based, and also considering that my thesis should expand the work of the GSoC project to comply with a master thesis's higher effort/academic requirements (exactly 6 months at TUM), for my project I expand and elaborate further on an idea discussed with Jimregan on the use of transducers in replacement of flag diacritics, as used in hfst, and include a part for automatic topology learning to generate such transducers.

The organization for such a combined thesis/project if you accepted my proposal would be probably as follows: finally, my before possible advisor from TUM cannot supervise this thesis due to lack of time. That means, unless I can find another advisor from TUM, and if you wanted, I would need a mentor for the GSoC project and an official advisor for my thesis.


I will happy to discuss the details and get your opinion for the organization of this thesis/project.


  • Name: Juan Miguel Cejuela
  • Email: juanmi@jmcejuela.com
  • Citizenship: Spanish, European Union
  • Location: Munich, Germany
  • Position: MSc Computer Science student at Technical University of Munich.
  • irc, skype, twitter, ...: jmcejuela


Why is it you are interested in machine translation?[edit]

As my background & skills show, see below, I've followed a work/research that directly conduct me to this. Despite not having yet worked directly in machine translation, I've had for many years a strong desire in it, and now I'd love to invest the effort and time of my master thesis to finally get dirty with it. I'm well acquainted with many tools that are used in machine translation, including transducers, automata, HMMs, grammatical parsers, programming languages parsers, text mining, stemmers, string edit distance algorithms, fuzzy logic...

Besides, I'm myself an avid language learner and currently speak Spanish, English, and German ---apart of programming languages, of course. I find languages fascinating for they frame and make possible communication, both between humans, computers, and maybe one day humans-computers. Also and although, as analogy with the computer science world, all languages are Turing machine complete, in practice it's extremely different how to convey different ideas in different languages, and some languages are best suited for particular concepts. Furthermore, the well understanding and translation of languages plays a crucial role in the development of this already globalized world.

I want to grasp a better understanding of languages in general, how they work, how machines can process them or even understand them, and finally how it can be possible to have human-like machine translation and natural language processing. Morever, I want to continue working on transducers, for which lately, due to my work with them, I've obtained a certain degree of expertise with them, and I would like to play with them in real applications.


Why is it that you are interested in the Apertium project?[edit]

I've just known recently the Apertium project and I'm still studying it but my first impressions are very good. Because:

  • It's a (medium-size) open source project, and that means: open discussions and community critical review/thinking, contribution to others beyond this project. I'm sure I will learn a lot.
  • The wiki so far appeared to me just great. Well documented.
  • I've seen there is people here from several different backgrounds and cultures/countries. It seems a lot of fun.
  • Technically, I like the Apertium's interest for less popular languages for which less research is available. That means more fun. And yes, easier to publish ;)
  • Being Spanish, it interests me a lot that the project is funded by the Spanish government and is Spain-based.


My only concern is that you don't use much statistical learning (recently Jimmy O'Regan told me Felipe Sánchez Martínez's statistical machine translation work) I've worked mostly with statistical methods and little with rule-based systems. I hope I can grasp a better understanding for this approach.


Which of the published tasks are you interested in? What do you plan to do?[edit]

Note that being obliged to expand the GSoC project for my master thesis, I try to officially delimit the exact things I do for one and the other.


Project Name: Transducers as flag diacritics and their topology learning


This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal to study and use topology-learned transducers to be placed as such 2nd level transducers.


Short Description[edit]

The implementation of a module to handle languages with infix inflection (and possibly other forms), to stop, according to defined constraints, useless and invalid continuation computations as flag diacritics do, for example in the HFST platform, but using instead the novel approach of a second/n level. This is approach is desired as transducers may provide greater power, expressiveness, and flexibility.


Description[edit]

The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.

Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.

The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.


Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.


The programming language to use will be mostly Java, using C++ if required.


Contribution[edit]

  • The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
  • The novelty, as far as I know, of the use of transducers to tackle such a problem.
  • Altogether, a better support for languages with more complex forms of inflection.


Work Plan: Timeline[edit]

I mostly consider here the exact and specific work for the GSoC project, not the thesis's. The following is an estimation:


  • Community Bonding Period: April 25 - May 23:
  • know well Apertium's project and internals; get to know the community; know work & code standards; review C++; study Apertium's architecture and code
  • study the lttoolbox-java
  • study problem to solve; study flag diacritics approach; design plan solution for the problem


  • Start: transducers as flag diacritics
  • Week 1-3: code implementation; parallel testing & documentation
  • Week 4: code implementation; start of more formal testing
  • Deliverable #1: module ready for the lttoolbox-java & documentation


  • Verification on target language
  • Week 5: apply module to some specific language and validate correctness
  • Deliverable #2: positive formal verification results


  • Topology Learning
  • Week 6: research state-of-the-art methods and algorithms; design solution; ultimate corpus for which to apply the learning method
  • Week 7-10: implement, review, test, and document the topology learning algorithm
  • Week 11: verify algorithm
  • Deliverable #3: topology learning algorithm module ready & documentation


  • Set & Run
  • Week 12: (1) Deliver transducer/flag diacritics module into official program; (2) deliver topology learning module into official program


  • Project Completed: Deliverable #4: verified & delivered code & documentation



Background & Skills[edit]

As I've listed in the first section I have a rich experience with multiple staple tools used for machine translation: transducers, automata, HMMs, grammatical parsers, programming languages parsers, text mining, stemmers, string edit distance algorithms, fuzzy logic...

Specifically for transducers, I've recently worked on a seminar on EM Training for Weighted Transducers and I'm about to publish a paper describing my novel conversion of such an algorithm to log space to be able to work with it on a machine in practice. This is not trivial, since both sums and vector operations are involved in the algorithm.

As for open source projects, my biggest contributions are so far:

  • CL-HMM: a HMM library in Common Lisp written from scratch by me that was the work of my bachelor thesis, at Aarhus Universitet. The library was gonna be used in BioLisp at Berkley, but unfortunately the group has ceased operation since 2009.
  • Small contribution to the Anki project, with some plugins, and some scripts for the the sister Ankidroid project.


For more information, please see my CV/Résumé.


Other Commitments[edit]

I have for the following 6/7 months no other important commitment and I will focus entirely on my thesis/project.