User:Aikoniv/GSoC20010Application

From Apertium
Jump to navigation Jump to search

This is a WIP. Comments/suggestions/critique would be greatly appreciated.

GSoC Application: Morphology with HFST

General Info

Name: Brian Croom

E-mail address: brian.s.croom@gmail.com

Jabber ID: brian.s.croom@gmail.com

IRC nick: aikoniv@irc.freenode.net

Apertium Wiki: Aikoniv

Why is it you are interested in machine translation?

Two of my greatest passions are human languages and computers. Languages fascinate me because of the paradoxical blend of structured order and ambiguity that is inherent in the system. Ambiguity is unavoidable due to the way in which people's perceptions of the world are incomplete and are constantly filtered through the lens of past experience. At the same time, the urge of humans to discern and create order in their environment is manifest in the many ways in which order has been found and described in human languages on a variety of levels, such as is studied in the fields of syntax, morphology, phonology, etc. And yet a complete description of such a system with its many exceptions is continuously thwarted by the unpredictability and ingenuity of the humans using the system.

On the other hand, computers, implementing a well-defined, describable system, excite another part of my mind, a part that thrives on unambiguity and predictability. This is a self-contained system, able, on a certain level, to be studied in its entirety. The great challenge of machine translation is therefore to reduce the complexity of human languages to the point that they can be dealt with and manipulated by a computer system, while not giving up the sophistication and beauty of the original languages. In my opinion, the intractability of this problem is no reason to ignore it, as the intellectual rewards are great, and the practical benefits are also attractive.

Even a machine translation system that is far from perfection has much to offer society. Language barriers are as big an issue today as they have ever been in hindering fruitful communication between people, and any application of technology that has the potential of lowering some of these barriers is deserving of study.

Why is it that you are interested in the Apertium project?

The biggest draw of Apertium for me is that it is an open-source project, welcoming anybody to join in improving the platform and broadening the set of languages which it can work with, and making the results of this work freely available, allowing individuals to exercise their creativity in dreaming up ways in which the technology might be applied to real-world problems. My experiences thus far with Apertium development community on the mailing list and on IRC have been very positive. I have received quick and helpful answers to my questions and have felt encouraged to pursue further engagement in the project. I understand how important a strong community is to the health of any open-source project, and my interaction with this community has increased my interest in contributing to it.

The work with Esperanto within the Apertium is another specific aspect of the project that was key in initially getting me to learn about it. I first heard about Apertium and its application to be a GSoC mentor organization through a posting by Jacob Nordfalk to the mailing list of Ubuntu's Esperanto localization team.

Which of the published tasks are you interested in?

I would like to implement the published task "Text tokenization in HFST"

What do you plan to do?

The idea is to develop a new tool for doing morphological analysis and generation, tentatively named hfst-proc, which integrates well into the Apertium pipeline. This new tool, which will be based on the Helsinki Finite State Toolkit (HFST), will function as much as possible as a drop-in replacement for lt-proc from Apertium's lttoolbox. Key features are thus as follows:

  • Of the modes provided by lt-proc, it will implement at least "analysis" and "generation", and perhaps also "lexical transfer", "post-generation", and "transliteration"
  • It will implement an algorithm (as described here) for tokenizing the input stream while simultaneously preforming the morphological analysis. This is in contrast to the functionality of the current hfst-lookup tool, which expects pre-tokenized input on a line-by-line basis.
  • It will work seamlessly with the Apertium Stream Format. This is essential for pipeline integration.

Project Motivation, or, why should Google and Apertium sponsor this project?

This project will provide Apertium with a new module which will allow it to handle additional languages whose morphology is too complicated for lttoolbox to deal with. There is data freely available in HFST-compatible form which will be accessible for creating new Apertium language pairs. And more immediately, the sme-nob language pair in the incubator will no longer require pipeline hacks to coerce the current HFST tools to play nice with Apertium.

The ability to use HFST-style data in Apertium will greatly reduce the barrier to the creation of new language pairs for languages where such data is already available. This, in turn, broadens the usefulness of Apertium as an open-source platform for machine translation, a tool which can be of benefit to many people in need of translation software, especially for the more exotic languages which Apertium specializes in.

Work Plan

  • Community Bonding Period
Spend time in #apertium getting to know my mentors and getting more familiar with the community. Read whatever relevant documentation exists, explore Apertium and HFST source code more deeply to acquire an understanding of how to interface with the HFST library. Keep notes in the Wiki
  • Week 1
Perform any refactoring needed in the HFST library to enable working with the transducer code on a symbol-by-symbol basis as required by the tokenize-as-you-analyze algorithm.
  • Week 2-3
Starting with skeleton code for a command-line utility with lt-proc as a base, develop an implementation of a morphological analyzer that works character-by-character on an input stream, utilizing the tokenize-as-you-analyze algorithm and the HFST library.
  • Week 4
Finish implementation of analyzer, ensuring correct handling of the Apertium Stream Format
  • Deliverable: hfst-proc tool with functional mode for morphological analysis
  • Week 5-6
Begin implementing the tool's generation mode
  • Week 7
Complete generation mode, examine the feasibility of reimplementing other features available in lt-proc such as lexical transfer and transliteration modes, dictionary case, and null flush
  • Deliverable: hfst-proc tool with functional modes for morphological analysis and generation
  • Week 8-9
Implement other (smaller-scale) features that were deemed reasonable for this time period
  • Week 10-12
Focused testing phase, identifying and squashing bugs, cleaning up code and ensuring that the code documentation is sufficient and accurate. A key element of this testing is making sure that the Apertium stream format, superblanks, and capitalization are handled properly (i.e. the way lt-proc does) in all modes
  • Project completed

Development Notes

  • As a mechanism for accountability and to facilitate feedback from my mentors and any others interested, I plan to do development in a publicly-accessible repository, preferably within either the Apertium or HFST source trees so that the code remains in close proximity to the projects it is dealing with.
  • Documentation will mostly take the form of code commented as it is written, with particular emphasis on explaining the algorithms used.
  • If no unexpected difficulties arise in the development process, it is very possible that work will proceed considerably faster than estimated in the work plan. If this occurs and the tool is completed with significant time remaining, the tentative plan is to begin work on features desired for the sme-nob language pair. This makes sense considering that this is the language pair for which the main project has the most immediate benefit, and that Francis Tyers, who will be one of my mentors, has great interest in this language pair.

Other Commitments

My classes run through the middle of the community bonding period, but after the end of the semester I have very few other commitments for this summer. Provided I am selected to work on this project, I will not seek any additional employment, nor am I applying for any internships or taking classes. I expect to travel to visit some friends and family, but always with the understanding that I will continue to work full-time or near full-time on this project, and I will continue to have consistent Internet access during those times.

Biography

I am a 21-year old undergraduate student in my fourth year at Wheaton College in Illinois, USA, working towards my B.A. with majors in German and Ancient Languages (read: Greek, Latin, and Biblical Hebrew), and a minor in Computer Science. I have programming since I was twelve years old, when I taught myself the rudiments of C++ with the help of a book from my public library. In the mean time I have spent countless hours improving my software development skills and learning how to make computers do what I want them to. I picked up Java when I began taking classes at my college, taught myself Python in 2008 as a way to broaden my language experience, and over the past few years have gotten familiar with Linux as a development environment, getting familiar with the many tools readily available in that environment, such as Bash scripting and sed, as well as multiple revision-control systems. In addition to my CS classwork, where I have learned much about software development and working on projects as part of a team, I have worked on a freelance software project each of the past two summers, one of which was a tool for testing students' knowledge of French, Spanish, and German verb conjugation. I love programming because it gives me the opportunity to examine and work on interesting problem-solving challenges.

Since I first was introduced to Linux by a friend in 2006, Free Software has also become an important part of my life. I had been using it for years in the form of GCC etc. without being aware of it, but after installing Ubuntu on my laptop I quickly began to learn about FLOSS, its history, its philosophy, its incredible accomplishments, and its goals for future development. I have become convinced of the value of open-source software and its importance in the software industry. To this point my open-source contributions have not been in the form of actual code, but I am eager to change that and gain practical experience interacting with the community around an open-source project.

I am a language nerd, who discovered this interest during High School after stumbling upon a web-site promising to teach J.R.R. Tolkien's Sindarin language. After realizing that I thoroughly enjoyed learning about this language's grammar, I began to avidly research foreign languages and linguistics. I now am fluent in German, and have studied numerous other languages to various extents, including Spanish, Esperanto, Swedish, Old English, Classical Greek, Latin, and Hebrew. I like to consider myself an amateur linguist, and expect to pursue a Master's degree eventually in a linguistics-related field.

Conclusion

Thank you for considering my proposal. I look forward to working with you to improve Free Software. Thanks to everyone at Google and with Apertium involved in this year's Summer of Code!