Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"
Jump to navigation
Jump to search
TommiPirinen (talk | contribs) |
|||
Line 8: | Line 8: | ||
==Coding challenge== |
==Coding challenge== |
||
+ | Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic. |
||
− | * Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary. |
||
+ | |||
− | * Make sure that the output before/after is identical. |
||
+ | e.g. |
||
+ | |||
+ | <pre> |
||
+ | echo "This! Is a tešt тест ** % test." | ./classify-symbols |
||
+ | C T |
||
+ | C h |
||
+ | C i |
||
+ | C s |
||
+ | X ! |
||
+ | X |
||
+ | C I |
||
+ | C s |
||
+ | |||
+ | ... |
||
+ | </pre> |
||
+ | |||
== Further readings == |
== Further readings == |
Revision as of 01:06, 31 January 2019
Task
- Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.
Coding challenge
Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
e.g.
echo "This! Is a tešt тест ** % test." | ./classify-symbols C T C h C i C s X ! X C I C s ...