Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

Revision as of 01:06, 31 January 2019

Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.

Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...

@@ Line 8: / Line 8: @@
 ==Coding challenge==
+Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
-* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
-* Make sure that the output before/after is identical.
+e.g.
+<pre>
+echo "This! Is a tešt тест ** % test." | ./classify-symbols
+C T
+C h
+C i
+C s
+X !
+X
+C I
+C s
+...
+</pre>
 == Further readings ==