Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Ideas for Google Summer of Code/Robust tokenisation

From Apertium
< Ideas for Google Summer of Code(Difference between revisions)
Jump to: navigation, search
(Created page with " ==Task== * Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols. ==Coding challenge== * Remove all multiwords from an Apertium languag...")
 
Line 10: Line 10:
 
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.
 
* Make sure that the output before/after is identical.
 
* Make sure that the output before/after is identical.
  +
  +
== Further readings ==
  +
  +
* https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc
  +
* https://unicode.org/reports/tr29/
  +
   
 
[[Category:Google Summer of Code|Robust tokenisation]]
 
[[Category:Google Summer of Code|Robust tokenisation]]

Revision as of 14:06, 29 January 2018


Task

  • Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols.


Coding challenge

  • Remove all multiwords from an Apertium language pair and put them in an apertium-separable dictionary.
  • Make sure that the output before/after is identical.

Further readings

Personal tools