Ideas for Google Summer of Code/Robust tokenisation

From Apertium
Jump to navigation Jump to search


Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


Coding challenge

Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...


Further readings