Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

From Apertium
Jump to navigation Jump to search
(Created page with " ==Task== * Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols. ==Coding challenge== * Remove all multiwords from an Apertium languag...")
 
Line 10: Line 10:
 
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 
* Make sure that the output before/after is identical.
 
* Make sure that the output before/after is identical.
  +
  +
== Further readings ==
  +
  +
* https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc
  +
* https://unicode.org/reports/tr29/
  +
   
 
[[Category:Google Summer of Code|Robust tokenisation]]
 
[[Category:Google Summer of Code|Robust tokenisation]]

Revision as of 13:06, 29 January 2018


Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


Coding challenge

  • Remove all multiwords from an Apertium language pair and put them in an apertium-separable dictionary.
  • Make sure that the output before/after is identical.

Further readings