Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"
Jump to navigation
Jump to search
(Created page with " ==Task== * Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols. ==Coding challenge== * Remove all multiwords from an Apertium languag...") |
TommiPirinen (talk | contribs) |
||
Line 10: | Line 10: | ||
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary. |
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary. |
||
* Make sure that the output before/after is identical. |
* Make sure that the output before/after is identical. |
||
== Further readings == |
|||
* https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc |
|||
* https://unicode.org/reports/tr29/ |
|||
[[Category:Google Summer of Code|Robust tokenisation]] |
[[Category:Google Summer of Code|Robust tokenisation]] |
Revision as of 13:06, 29 January 2018
Task
- Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.
Coding challenge
- Remove all multiwords from an Apertium language pair and put them in an apertium-separable dictionary.
- Make sure that the output before/after is identical.