Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
==Coding challenge==
==Coding challenge==


Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.

* Make sure that the output before/after is identical.
e.g.

<pre>
echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i
C s
X !
X
C I
C s

...
</pre>



== Further readings ==
== Further readings ==

Revision as of 01:06, 31 January 2019


Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


Coding challenge

Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...


Further readings