Difference between revisions of "Ideas for Google Summer of Code/Improving support for non-standard text input"
Jump to navigation
Jump to search
(Created page with "Create a module that will standardise non-standard input. For example, slang, abbreviations. ==Some examples from English== * Extra space: "he he" (hehe) * Spacing and hyph...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
+ | |||
Create a module that will standardise non-standard input. For example, slang, abbreviations. |
Create a module that will standardise non-standard input. For example, slang, abbreviations. |
||
Line 9: | Line 11: | ||
* Non-standard capitalisation: im thinking about it |
* Non-standard capitalisation: im thinking about it |
||
* Abbreviated words: fav, |
* Abbreviated words: fav, |
||
+ | * Emoticons: :) |
||
==Coding challenge== |
==Coding challenge== |
||
+ | |||
+ | * Make a test corpus of non-standard texts for a particular domain (could be IRC, tweets, forums etc.) |
||
+ | * Translate them with Apertium |
||
+ | * Come up with examples of non-standard features that effect translation quality |
||
+ | * Propose ways in which they might be solved. |
||
==Tasks== |
==Tasks== |
||
+ | |||
+ | * Do a literature review of papers on normalisation of input. |
||
==Frequently asked questions== |
==Frequently asked questions== |
Latest revision as of 12:50, 10 March 2014
Create a module that will standardise non-standard input. For example, slang, abbreviations.
Some examples from English[edit]
- Extra space: "he he" (hehe)
- Spacing and hyphen variation: no-one, noone, no one
- Optional hyphen: re-integrate, reintegrate
- Missing apostrophe: shes thinking about it
- Non-standard capitalisation: im thinking about it
- Abbreviated words: fav,
- Emoticons: :)
Coding challenge[edit]
- Make a test corpus of non-standard texts for a particular domain (could be IRC, tweets, forums etc.)
- Translate them with Apertium
- Come up with examples of non-standard features that effect translation quality
- Propose ways in which they might be solved.
Tasks[edit]
- Do a literature review of papers on normalisation of input.