User:Davidc

From Apertium
Revision as of 14:57, 9 April 2010 by Davidc (talk | contribs) (Created page with 'GSoC 2010 Application David Cheah – Google Summer of Code Proposal for Apertium Name: David Cheah Email Address:chaoticefx@gmail.com IRC:DavidC Personal Information…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSoC 2010 Application

David Cheah – Google Summer of Code Proposal for Apertium


Name: David Cheah Email Address:chaoticefx@gmail.com

IRC:DavidC

Personal Information


Why Machine Translation?

To be very honest, I have never fully taken notice of computerized translation before, simply attributing it to programs working behind the scenes but never looking further into the matter. It’s always been something that I have very much taken for granted over the years as something that simply works. Otherwise, it at least works to a point where I could understand the gist of what was going on in a webpage or a block of text.


However after coming across Apertium when looking for a Google Summer of Code project, I did some research and realized just how expansive the realm of translation is. Machine translation is something new, something I’ve never even considered exploring before and I feel that the best way to explore it is to jump straight in. Now seems like the perfect opportunity to get a guided introduction to machine translation. This desire to learn something new, engaging and extremely useful is why I choose such a topic for my Google Summer of Code proposal.


Why Apertium(and Voikko)?

As a first time applicant for Google’s Summer of Code, joining any IRC chat room of any project was a daunting task. Most of the projects were too big, and too complex for me to be noticed. Others were difficult to access or out of the scope of my abilities.


However, Apertium, with its relatively small team size, and welcoming developers made me feel instantly at home. After finding out more about Apertium and machine translation (as mentioned above), Apertium seems like an excellent starting point for me to learn about something new.


The developers at Voikko have been similarly welcoming. This gives me confidence that I will have a good chance of completing my project despite how new I am to both projects.


Personal Qualifications

I am a first year student at the National University of Singapore, studying Computer Engineering. I am also part of the University Scholars Programme – a programme which allows the students which take part to extend their degree structure to include modules from several other faculties. In my case I am allowed to explore modules pertaining to Business Management, the Arts, Natural and Social Sciences.


My command of C++ is basic – I have been trained in ANSI C, C++ and Microsoft Visual C++ 2008 over the course of the last academic year. I have been able to fix compilation errors in Apertium in VC++ 2008, compile libraries for Voikko in MinGW and Python, and fix basic errors to allow those to build.


Having never worked with another open source project before, Apertium is potentially my first.


Other Summer Commitments

I am involved in a Freshman Orientation Camp project for the upcoming academic year - time commitment is estimated to be about one and a half weeks spread over most of the summer. The main commitment will be several straight days late into the project.



The Project: Why lttoolbox-libvoikko integration?


Completing this project will allow the language pairs created primarily for Apertium translation to be used in Voikko and the extensions it’s used in. This would extend the use of existing Apertium dictionaries into widely used programs such as OpenOffice and Mozilla Firefox. Through this Apertium is likely to get more exposure and use, and creating a language pair in Apertium would likely benefit a wider community than before.


Proposal


Synopsis:

I intend to take up the idea mentioned in the Ideas for Google Summer of Code wiki page on improving the integration of lttoolbox in libvoikko. This includes finding and fixing class and variable conflicts within the namespace, as well as writing the methods necessary to allow these 2 libraries to work together.


In detail:


- Main Task 1: Cleaning up namespace and writing string analyzer

o Run through the code in both libraries, listing down any similar classes or variables.

o Communicate this list to other developers to check if altering them will result in unwanted errors.

o Change the conflicting classes/variables depending on how the situation allows. Changes will be mainly made on the lttoolbox side.

o Code a method which allows liblttoolbox to analyse a string instead of a filestream – possibly an overloaded method?

o Make the necessary changes inside libvoikko to make sure these new methods work as planned.

- Main Task 2: Improvements

o Analyse and devise a way to set which libvoikko SuggestionGenerators to use for particular languages.

o Compile a list of these rules, and find a way to activate the respective SuggestionGenerators based on the different language passed to lttoolbox.

o Code a method to provide spelling suggestions for misspelled words. This means adding more SuggestionGenerator files and methods to encompass the languages that lttoolbox can analyse.

o (My goal for this task is to create as good as possible a spelling suggestion generator for a particular language, and then provide documentation as to how SuggestionGenerators can be provided and controlled for other languages. I do not believe it will be realistic to complete all the SuggestionGenerators for all supported languages.)

- Secondary Tasks:

o Write/modify the method for capital letters

o Deal with additional symbols which have not been added into the libvoikko code.


Project Timeline:


- Bonding Period:

o Look through the whole of lttoolbox, identifying crucial changes to the API structure for cleaning up of the namespace. Communicate the necessary changes to the other developers as early as possible.

o Ensure all files are ready for coding (for both Voikko and Apertium)

o Run through and set up necessary test beds (looking at the automated tests already used in Voikko and porting them to other languages supported by Apertium)

o Run through documentation and code for a better understanding

o Attempt to compile Apertium in its entirety in Visual C++ 2008. (Currently have compiled up till libvoikko, but not Apertium itself)

o Update wiki on compiling Apertium in VC++ where possible

o Compile the required list of SuggestionGenerator improvements, what new SuggestionGenerators might be needed, and set a concrete target as to how many new methods/improvements can be completed within the GSoC timeframe.

o Get a list of test cases for different languages.

o Study the language pair HOWTO to see if SuggestionGenerators can be generated from a language pair. This would greatly reduce workload in customising SuggestionGenerators for each different language.

- Month 1:

o Week 1&2: Perform as many namespace changes as possible. Also, take an in-depth look at lttoolbox’s file stream method(s) and code an alternative string version.

o Week 3&4: Start coding SuggestionGenerator files.

o DELIVERABLE 1: string analysis method more or less complete, tested with basic test cases.

- Month 2:

o Week 1,2&3: Complete new SuggestionGenerator files, finish improvements to old ones. Add new SuggestionGenerator files to SuggestionGeneratorFactory. (Will be busy during Week 1 with other commitments as well, thus 3 weeks.)

o Week 4: Write capital letter checking algorithm, add additional symbols into libvoikko code.

o DELIVERABLE 2: Basic algorithms completed, tested with basic test cases.

- Month 3:

o Complete remaining namespace clean up and SuggestionGenerators that can be finished.

o Apply the basic test case from ApertiumIcelandicTest.py to other languages using the relevant list built during the community bonding period.

o Perform systematic testing – in particular test based on different language keyboard layouts and word forms.

o Misc work such as testing and building Apertium in Windows.