Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
 
==Coding challenge==
 
==Coding challenge==
   
  +
Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 
  +
* Make sure that the output before/after is identical.
 
  +
e.g.
  +
  +
<pre>
  +
echo "This! Is a tešt тест ** % test." | ./classify-symbols
  +
C T
  +
C h
  +
C i
  +
C s
  +
X !
  +
X
  +
C I
  +
C s
  +
  +
...
  +
</pre>
  +
   
 
== Further readings ==
 
== Further readings ==

Revision as of 01:06, 31 January 2019


Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


Coding challenge

Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...


Further readings