Ideas for Google Summer of Code/Improving support for non-standard text input
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 12:50, 10 March 2014 by Francis Tyers (talk | contribs)
Create a module that will standardise non-standard input. For example, slang, abbreviations.
Some examples from English
- Extra space: "he he" (hehe)
- Spacing and hyphen variation: no-one, noone, no one
- Optional hyphen: re-integrate, reintegrate
- Missing apostrophe: shes thinking about it
- Non-standard capitalisation: im thinking about it
- Abbreviated words: fav,
- Emoticons: :)
Coding challenge
- Make a test corpus of non-standard texts for a particular domain (could be IRC, tweets, forums etc.)
- Translate them with Apertium
- Come up with examples of non-standard features that effect translation quality
- Propose ways in which they might be solved.
Tasks
- Do a literature review of papers on normalisation of input.