Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

Revision as of 13:06, 29 January 2018

Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.

Remove all multiwords from an Apertium language pair and put them in an apertium-separable dictionary.
Make sure that the output before/after is identical.

@@ Line 10: / Line 10: @@
 * Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 * Make sure that the output before/after is identical.
+== Further readings ==
+* https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc
+* https://unicode.org/reports/tr29/
 [[Category:Google Summer of Code|Robust tokenisation]]