Ideas for Google Summer of Code/Robust tokenisation
Apertium has a custom tokenisation algorithm based on the alphabet that the dictioary writer writes in the dictionary file plus partially the characters found in the actual dictionary entries. This leads to some hard to understand problems in pipeline and especially when HFST-based analysers are used. Furthermore the tokenisation is rather suboptimal for languages where there is no non-word characters to separate words (e.g. whitespace). Also different white space, hyphen, zero-width characters etc. etc. are handled quite randomly.
- Names etc. with accent not in alphabet or dictionary: Müller should be one token even if ü does not appear in dictionary or alphabet
- Compounding strategies: banana-door may be 1 or 2 tokens depending on dictionary writers preferences and should not be effected if - is unicode character MINUS-HYPHEN, HYPHEN or EN-DASH, a strategy must also consider if - is replaced with ZERO-WIDTH JOINER or even NON-BREAKING SPACE
- No-space scripts (is this solved by https://github.com/chanlon1/tokenisation ?)
- Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.
- Allow dictionary developers some control over tokenisation
The final algorithm should be improvement upon current tokenisation so care needs to be taken that original ideas of inconditionals, et. dictionary blocks, I suggest test-driven development for your plan.
Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
echo "This! Is a tešt тест ** % test." | ./classify-symbols C T C h C i C s X ! X C I C s ...