Difference between revisions of "User:Aikoniv/GSoC20010Application"

From Apertium
Jump to navigation Jump to search
Line 24: Line 24:
   
   
Which of the published tasks are you interested in?
+
Which of the published tasks are you interested in?
 
 
 
   
  +
I would like to implement the task "Text tokenization in HFST"
   
   
 
What do you plan to do?
 
What do you plan to do?
   
  +
The idea is to develop a new tool for doing morphological analysis and generation, tentatively named hfst-proc, which integrates well into the Apertium pipeline. This new tool, which will be based on the Helsinki Finite State Toolkit (HFST) [***] will function as much as possible as a drop-in replacement for lt-proc from Apertium's lttoolbox. Key features are thus as follows:
   
  +
* Of the modes provided by lt-proc, it will implement at least "analysis" and "generation", and perhaps also "lexical transfer", "post-generation", and "transliteration"
  +
* It will implement an algorithm (as described here [***]) for tokenizing the input stream while simultaneously preforming the morphological analysis. This is in contrast to the functionality of the current hfst-lookup tool, which expects pre-tokenized input on a line-by-line basis.
  +
* It will work seamlessly with the Apertium Stream Format. This is essential for pipeline integration.
   
  +
Project Result:
   
  +
This project will provide Apertium with a new module which will allow it to handle additional languages whose morphology is too complicated for lttoolbox to deal with. There is data freely available in HFST-compatible form which will be accessible for creating new Apertium language pairs. And more immediately, the sme-nob language pair in the incubator will no longer require pipeline hacks to coerce the current HFST tools to play nice with Apertium.
   
   

Revision as of 21:58, 30 March 2010

This is a WIP

Name: Brian Croom

E-mail address: brian.s.croom@gmail.com Jabber ID: brian.s.croom@gmail.com IRC nick: aikoniv


Why is it you are interested in machine translation?

Two of my greatest passions are human languages and computers. Languages fascinate me because of the paradoxical blend of structured order and ambiguity that is inherent in the system. Ambiguity is unavoidable due to the way in which people's perceptions of the world are incomplete and are constantly filtered through the lens of past experience. At the same time, the urge of humans to discern and create order in their environment is manifest in the many ways in which order has been found and described in human languages on a variety of levels, such as is studied in the fields of syntax, morphology, phonology, etc. And yet a complete description of such a system with its many exceptions is continuously thwarted by the unpredictability and ingenuity of the humans using the system.

On the other hand, computers, implementing a well-defined, describable system, excite another part of my mind, a part that thrives on unambiguity and predictability. This is a self-contained system, able, on a certain level, to be studied in its entirety. The great challenge of machine translation is therefore to reduce the complexity of human languages to the point that they can be dealt with and manipulated by a computer system, while not giving up the sophistication and beauty of the original languages. In my opinion, the intractability of this problem is no reason to ignore it, as the intellectual rewards are great, and the practical benefits are also attractive.

Even a machine translation system that is far from perfection has much to offer society. Language barriers are as big an issue today as they have ever been in hindering fruitful communication between people, and any application of technology that has the potential of lowering some of these barriers is deserving of study.


Why is it that you are interested in the Apertium project?

The biggest draw of Apertium for me is that it is an open-source project, welcoming anybody to join in improving the platform and broadening the set of languages which it can work with, and making the results of this work freely available, allowing individuals to exercise their creativity in dreaming up ways in which the technology might be applied to real-world problems. My experiences thus far with Apertium development community on the mailing list and on IRC have been very positive. I have received quick and helpful answers to my questions and have felt encouraged to pursue further engagement in the project. I understand how important a strong community is to the health of any open-source project, and my interaction with this community has increased my interest in contributing to it.

The work with Esperanto within the Apertium is another specific aspect of the project that was key in initially getting me to learn about it. I first heard about Apertium and its application to be a GSoC mentor organization through a posting by Jacob Nordfalk to the mailing list of Ubuntu's Esperanto localization team.


Which of the published tasks are you interested in?

I would like to implement the task "Text tokenization in HFST"


What do you plan to do?

The idea is to develop a new tool for doing morphological analysis and generation, tentatively named hfst-proc, which integrates well into the Apertium pipeline. This new tool, which will be based on the Helsinki Finite State Toolkit (HFST) [***] will function as much as possible as a drop-in replacement for lt-proc from Apertium's lttoolbox. Key features are thus as follows:

  • Of the modes provided by lt-proc, it will implement at least "analysis" and "generation", and perhaps also "lexical transfer", "post-generation", and "transliteration"
  • It will implement an algorithm (as described here [***]) for tokenizing the input stream while simultaneously preforming the morphological analysis. This is in contrast to the functionality of the current hfst-lookup tool, which expects pre-tokenized input on a line-by-line basis.
  • It will work seamlessly with the Apertium Stream Format. This is essential for pipeline integration.

Project Result:

This project will provide Apertium with a new module which will allow it to handle additional languages whose morphology is too complicated for lttoolbox to deal with. There is data freely available in HFST-compatible form which will be accessible for creating new Apertium language pairs. And more immediately, the sme-nob language pair in the incubator will no longer require pipeline hacks to coerce the current HFST tools to play nice with Apertium.


Applicants should also include a two- to eight-page proposal , including a title, reasons why Google and Apertium should sponsor it, a description of how and who it will benefit, and a detailed work plan including, if possible, a schedule with milestones and deliverables. Include time needed to think, to program, to document and to disseminate.



In the proposal, list your skills and give evidence of your qualifications. Tell us what is current field of study, major, etc.



Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.




Other Commitments

I have very few other commitments for this summer. Provided I am selected to work on this project, I will not seek any additional employment, nor am I applying for any internships or taking classes. I expect to travel to visit some friends and family, but always with the understanding that I will continue to work full-time or near full-time on this project, and I will continue to have consistent Internet access during those times.