Difference between revisions of "Ideas for Google Summer of Code/Improving support for non-standard text input"

From Apertium
Jump to navigation Jump to search
 
Line 14: Line 14:
   
 
==Coding challenge==
 
==Coding challenge==
  +
  +
* Make a test corpus of non-standard texts for a particular domain (could be IRC, tweets, forums etc.)
  +
* Translate them with Apertium
  +
* Come up with examples of non-standard features that effect translation quality
  +
* Propose ways in which they might be solved.
   
 
==Tasks==
 
==Tasks==

Latest revision as of 12:50, 10 March 2014

Create a module that will standardise non-standard input. For example, slang, abbreviations.

Some examples from English[edit]

  • Extra space: "he he" (hehe)
  • Spacing and hyphen variation: no-one, noone, no one
  • Optional hyphen: re-integrate, reintegrate
  • Missing apostrophe: shes thinking about it
  • Non-standard capitalisation: im thinking about it
  • Abbreviated words: fav,
  • Emoticons: :)

Coding challenge[edit]

  • Make a test corpus of non-standard texts for a particular domain (could be IRC, tweets, forums etc.)
  • Translate them with Apertium
  • Come up with examples of non-standard features that effect translation quality
  • Propose ways in which they might be solved.

Tasks[edit]

  • Do a literature review of papers on normalisation of input.

Frequently asked questions[edit]

See also[edit]