Difference between revisions of "Ideas for Google Summer of Code/Accent and diacritic restoration"

From Apertium
Jump to navigation Jump to search
(Created page with '{{TOCD}} Many languages use diacritics and accents in normal writing, and Apertium is designed to use these, however in some places, especially for example. instant messaging, ir…')
 
Line 11: Line 11:
   
 
* Install Apertium
 
* Install Apertium
* Check out [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2011/charlifter last year's charlifter]
+
* Check out and compile [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2011/charlifter last year's charlifter]
  +
* Make it output properly encoded Unicode instead of hexadecimal characters.
   
 
==Frequently asked questions==
 
==Frequently asked questions==

Revision as of 12:33, 20 February 2012

Many languages use diacritics and accents in normal writing, and Apertium is designed to use these, however in some places, especially for example. instant messaging, irc, searching in the web etc. these are often not used or untyped. This causes problems as for the engine, traduccion is not the same as traducción. Create an optional module to restore diacritics and accents on input text, and integrate it into the Apertium pipeline.

Tasks

  • Finish the port of Kevin Scannell's charlifter to C++
  • Allow rule-based replacements of character sequences.
  • ...

Coding challenge

  • Install Apertium
  • Check out and compile last year's charlifter
  • Make it output properly encoded Unicode instead of hexadecimal characters.

Frequently asked questions

Previous GSOC projects