Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Ideas for Google Summer of Code/Robust tokenisation

From Apertium
Jump to: navigation, search


[edit] Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


[edit] Coding challenge

Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...


[edit] Further readings

Personal tools