User:Aikoniv/GSoC20010Application

From Apertium
Jump to navigation Jump to search

This is a WIP. Comments/suggestions/critique would be greatly appreciated.

GSoC Application: Morphology with HFST[edit]

General Info[edit]

Name: Brian Croom

E-mail address: brian.s.croom@gmail.com

Jabber ID: brian.s.croom@gmail.com

IRC nick: aikoniv@irc.freenode.net

Apertium Wiki: Aikoniv

Why is it you are interested in machine translation?[edit]

Two of my greatest passions are human languages and computers. Languages fascinate me because of the paradoxical blend of structured order and ambiguity that is inherent in the system. Ambiguity is unavoidable due to the way in which people's perceptions of the world are incomplete and constantly filtered through the lens of past experience. At the same time, the urge of humans to discern and create order in their environment is manifest in the many ways in which order has been found and described in human languages on a variety of levels, such as is studied in the fields of syntax, morphology, phonology, etc. And yet a complete description of such a system with its many exceptions is continuously thwarted by the unpredictability and ingenuity of the humans using the system.

On the other hand, computers, implementing a well-defined, describable system, excite another part of my mind, a part that thrives on unambiguity and predictability. This is a self-contained system, able, on a certain level, to be studied in its entirety. The great challenge of machine translation is therefore to reduce the complexity of human languages to the point that they can be dealt with and manipulated by a computer system, while not giving up the sophistication and beauty of the original languages. In my opinion, the intractability of this problem is no reason to ignore it, as the intellectual rewards are great, and the practical benefits are also attractive.

Even a machine translation system that is far from perfection has much to offer society. Language barriers are as big an issue today as they have ever been in hindering fruitful communication between people, and any application of technology that has the potential of lowering some of these barriers is deserving of study.

Why is it that you are interested in the Apertium project?[edit]

The biggest draw of Apertium for me is that it is an open-source project, welcoming anybody to join in improving the platform and broadening the set of languages which it can work with, and making the results of this work freely available, allowing individuals to exercise their creativity in dreaming up ways in which the technology might be applied to real-world problems. My experiences thus far with Apertium development community on the mailing list and on IRC have been very positive. I have received quick and helpful answers to my questions and have felt encouraged to pursue further engagement in the project. I understand how important a strong community is to the health of any open-source project, and my interaction with this community has increased my interest in contributing to it.

The work with Esperanto within the Apertium is another specific aspect of the project that was key in initially getting me to learn about it. I first heard about Apertium and its application to be a GSoC mentor organization through a posting by Jacob Nordfalk to the mailing list of Ubuntu's Esperanto localization team.

Which of the published tasks are you interested in?[edit]

I would like to implement the published task "Text tokenization in HFST"

What do you plan to do?[edit]

The idea is to develop a new tool for doing morphological analysis and generation, tentatively named hfst-proc, which integrates well into the Apertium pipeline. This new tool, which will be based on the Helsinki Finite State Toolkit (HFST), will function as much as possible as a drop-in replacement for lt-proc from Apertium's lttoolbox. Key features are thus as follows:

  • Of the modes provided by lt-proc, it will implement at least "analysis" and "generation", and perhaps also "lexical transfer", "post-generation", and "transliteration"
  • It will implement an algorithm (as described below) for tokenizing the input stream while simultaneously preforming the morphological analysis. This is in contrast to the functionality of the current hfst-lookup tool, which expects pre-tokenized input on a line-by-line basis.
  • It will work seamlessly with the Apertium Stream Format. This is essential for pipeline integration.

Project Motivation, or, why should Google and Apertium sponsor this project?[edit]

This project will provide Apertium with a new module which will allow it to handle additional languages whose morphology is too complicated for lttoolbox to deal with. There is data freely available in HFST-compatible form which will be accessible for creating new Apertium language pairs. And more immediately, the sme-nob language pair in the incubator will no longer require pipeline hacks to coerce the current HFST tools to play nice with Apertium.

The ability to use HFST-style data in Apertium will greatly reduce the barrier to the creation of new language pairs for languages where such data is already available. This, in turn, broadens the usefulness of Apertium as an open-source platform for machine translation, a tool which can be of benefit to many people in need of translation software, especially for the more exotic languages which Apertium specializes in.

Work Plan[edit]

  • Community Bonding Period
Spend time in #apertium getting to know my mentors and getting more familiar with the community. Read whatever relevant documentation exists, explore Apertium and HFST source code more deeply to acquire an understanding of how to interface with the HFST library. Keep notes in the Wiki
  • Week 1
Perform any refactoring needed in the HFST library to enable working with the transducer code on a symbol-by-symbol basis as required by the tokenize-as-you-analyze algorithm.
  • Week 2-3
Starting with skeleton code for a command-line utility with lt-proc as a base, develop an implementation of a morphological analyzer that works character-by-character on an input stream, utilizing the tokenize-as-you-analyze algorithm and the HFST library.
  • Week 4
Finish implementation of analyzer, ensuring correct handling of the Apertium Stream Format
  • Deliverable: hfst-proc tool with functional mode for morphological analysis
  • Week 5-6
Begin implementing the tool's generation mode
  • Week 7
Complete generation mode, examine the feasibility of reimplementing other features available in lt-proc such as lexical transfer and transliteration modes, dictionary case, and null flush
  • Deliverable: hfst-proc tool with functional modes for morphological analysis and generation
  • Week 8-9
Implement other (smaller-scale) features that were deemed reasonable for this time period
  • Week 10-12
Focused testing phase, identifying and squashing bugs, cleaning up code and ensuring that the code documentation is sufficient and accurate. A key element of this testing is making sure that the Apertium stream format, superblanks, and capitalization are handled properly (i.e. the way lt-proc does) in all modes
  • Project completed

Work Items[edit]

  • Determine if any necessary features are lacking from the HFST API
    • Add any missing features to the API hook them up with the OpenFST backend
  • Create skeleton code for the new tool by stripping down lt-proc
  • Implement morphological analysis by the tokenize-as-you-analyze algorithm with HFST
  • Add support for the Apertium Stream Format etc
  • Implement morphological generation with HFST
  • Add support for dictionary case
  • Add support for null flush

Development Notes[edit]

  • As a mechanism for accountability and to facilitate feedback from my mentors and any others interested, I plan to do development in a publicly-accessible repository, preferably within either the Apertium or HFST source trees so that the code remains in close proximity to the projects it is dealing with.
  • Documentation will mostly take the form of code commented as it is written, with particular emphasis on explaining the algorithms used.
  • At the recommendation of HFST developer Flammie, the development version of HFST, located in the hfst3_proposition svn branch, will be used as the base for the tool. The disadvantages of using this instead of the current hfst2 are that usage examples are scarce as none of the HFST tools have yet been ported to the new API, and there are possible feature regressions in comparison with hfst2, however the new API is stable enough at this point to serve as the foundation for this project. The advantages of using the development branch include the additional availability of Foma as a backend option, and a much simpler system for taking advantage of the different backends.
It is possible that a toolkit feature will be required for this project that is not yet available in the new API, so it may be necessary to extend the API, and then implement access to that feature for the backends. In that case, the OpenFST backend will be considered the target for the purposes of this project. This was suggested by Flammie.
  • It remains to be investigated how the new tool will deal with the lttoolbox concept of "inconditionals" which are used for dealing with e.g. punctuation in the input stream. This will be looked into during the community bonding period.
  • If no unexpected difficulties arise in the development process, it is very possible that work will proceed considerably faster than estimated in the work plan. If this occurs and the tool is completed with significant time remaining, the tentative plan is to begin work on features desired for the sme-nob language pair. This makes sense considering that this is the language pair for which the main project has the most immediate benefit, and that Francis Tyers, who will be one of my mentors, has great interest in this language pair. The details of this backup project can only be worked out once the main project has been successfully completed and the time remaining in the summer is known.

Tokenize-as-you-analyze[edit]

The following is pseudocode for the algorithm described here and also implemented by lttoolbox in fst_processor.cc, FSTProcessor::analysis (lines 807-1045). This algorithm will also be the basis of the lexical analysis functionality in hfst-proc.

surface_form = ""
lexical_forms = ""
last_stream_location = 0

while in_stream.has_more_chars()
   
   next_char = in_stream.get_next_char()
   
   // if any of the partial outputs in the transducer correspond to a valid word
   if transducer_state.are_finals()
       // save all current possible transductions
       lexical_forms = transducer_state.get_finals()
       last_stream_location = in_stream.get_pos()

   // update all partial transductions (SPO) based on the next char of the input
   transducer_state.step(next_char)
   
   if transducer_state.num_partial_transductions() > 0
       surface_form.append(next_char)
   else
       if surface_form == "" and not is_in_alphabet(next_char)
           out_stream.put_char(next_char)
       // elseif we don't have a valid transduction
       elseif lexical_forms == "" or is_in_alphabet(in_stream.char_at(last_stream_location))   
           if is_in_alphabet(next_char)
               // advance the input stream to the end of the word currently being read
               while is_in_alphabet(in_stream.peek_next_char())
                   in_stream.get_next_char()
               
               if not is_in_alphabet(surface_form.char_at(0))
                   out_sream.put_char(surface_form.char_at(0)
                   // back up the input stream so we consumed 
                   // only a single non-alphabetic character
                   in_stream.move_back(surface_form.length())
               else
                   word_length = first_nonalphabetic(surface_form)
                   write_unknown_word(out_stream, surface_form.substr(0, word_length))
                   in_stream.move_back(surface_form.length()-word_length)
       else    // there are one or more valid transductions
           write_word(out_stream, 
               surface_form.substr(0, in_stream.location()-last_stream_location),
               lexical_forms)
       
       // reset the transducer
       transducer_state.reset_to_initial()
       surface_form = lexical_forms = ""

Other Commitments[edit]

My classes run through the middle of the community bonding period, but after the end of the semester I have very few other commitments for this summer. Provided I am selected to work on this project, I will not seek any additional employment, nor am I applying for any internships or taking classes. I expect to travel to visit some friends and family, but always with the understanding that I will continue to work full-time or near full-time on this project, and I will continue to have consistent Internet access during those times.

Biography[edit]

I am a 21-year old undergraduate student in my fourth year at Wheaton College in Illinois, USA, working towards my B.A. with majors in German and Ancient Languages (read: Greek, Latin, and Biblical Hebrew), and a minor in Computer Science. I have been programming since I was twelve years old, when I taught myself the rudiments of C++ with the help of a book from my public library. In the mean time I have spent countless hours improving my software development skills and learning how to make computers do what I want them to. I picked up Java when I began taking classes at my college, taught myself Python in 2008 as a way to broaden my language experience, and over the past few years have gotten familiar with Linux as a development environment, getting familiar with the many tools readily available in that environment, such as Bash scripting and sed, as well as multiple revision-control systems. In addition to my CS classwork, where I have learned much about software development and working on projects as part of a team, I have worked on a freelance software project each of the past two summers, one of which was a tool for testing students' knowledge of French, Spanish, and German verb conjugation. I love programming because it gives me the opportunity to examine and work on interesting problem-solving challenges.

Since I first was introduced to Linux by a friend in 2006, Free Software has also become an important part of my life. I had been using it for years in the form of GCC etc. without being aware of it, but after installing Ubuntu on my laptop I quickly began to learn about FLOSS, its history, its philosophy, its incredible accomplishments, and its goals for future development. I have become convinced of the value of open-source software and its importance in the software industry. To this point my open-source contributions have not been in the form of actual code, but I am eager to change that and gain practical experience interacting with the community around an open-source project.

I am a language nerd, who discovered this interest during High School after stumbling upon a web-site promising to teach J.R.R. Tolkien's Sindarin language. After realizing that I thoroughly enjoyed learning about this language's grammar, I began to avidly research foreign languages and linguistics. I now am fluent in German, and have studied numerous other languages to various extents, including Spanish, Esperanto, Swedish, Old English, Classical Greek, Latin, and Hebrew. I like to consider myself an amateur linguist, and expect to pursue a Master's degree eventually in a linguistics-related field.

Conclusion[edit]

Thank you for considering my proposal. I look forward to working with you to improve Free Software. Thanks to everyone at Google and with Apertium involved in this year's Summer of Code!