Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"
Jump to navigation
Jump to search
TommiPirinen (talk | contribs) |
|||
Line 8: | Line 8: | ||
==Coding challenge== |
==Coding challenge== |
||
Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic. |
|||
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary. |
|||
* Make sure that the output before/after is identical. |
|||
e.g. |
|||
<pre> |
|||
echo "This! Is a tešt тест ** % test." | ./classify-symbols |
|||
C T |
|||
C h |
|||
C i |
|||
C s |
|||
X ! |
|||
X |
|||
C I |
|||
C s |
|||
... |
|||
</pre> |
|||
== Further readings == |
== Further readings == |
Revision as of 01:06, 31 January 2019
Task
- Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.
Coding challenge
Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
e.g.
echo "This! Is a tešt тест ** % test." | ./classify-symbols C T C h C i C s X ! X C I C s ...