Talk:Ideas for Google Summer of Code

From Apertium
Revision as of 17:11, 20 August 2008 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

So, was your organization a part of the google summer of code last year too?

Nope, but we're hoping to be included this year -- Francis Tyers 02:45, 16 March 2008 (UTC)

From old Projects page

Writing extensions to Apertium could be the ideal undergraduate (major) project. Here are some suggestions, along with brief outlines for how you might go about starting it.

A word compounder for Germanic languages

Most Germanic languages have compound words, we can analyse the compounds using LRLM (see Agglutination and compounds), but we cannot generate them without having them in the dictionary (a laborious task). The idea of this project it to create a post-generation module that takes series of words, e.g. in Afrikaans:

 vlote bestorming fase
 naval assault    phase

and turn them into compounds:


We don't want to compound all words, but it might be a good idea to compound those which have been seen before . There are many large wordlists of compound words that could be used for this. Of course if they aren't found maybe some kind of heuristics could be used. Probably we'd only want to compound where words are >= 5 characters long.

Automatic accent and diacritic insertion

One of the problems in machine translating text in real time chat environments (and generally) is the lack of accents or diacritic marks. This makes machine translation hard, because without the (´), traducción is an unknown word.

There is a need for a module for Apertium which would automatically replace the accents/diacritics on unaccented/diacritic'd words.

  • Simard, Michel (1998). "Automatic Insertion of Accents in French Texts". Proceedings of EMNLP-3. Granada, Spain.
  • Rada F. Mihalcea. (2002). "Diacritics Restoration: Learning from Letters versus Learning from Words". Lecture Notes in Computer Science 2276/2002 pp. 96--113