Automatic text normalisation

From Apertium
Revision as of 12:55, 23 March 2014 by Ksnmi (talk | contribs)
Jump to navigation Jump to search

General ideas

  • Diacritic restoration
  • Reduplicated character reduction
    • How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

Code switching

  • For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
    • Maybe this will be too heavy for the on the run application ( needs discussion )

To do list

  • Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
  • Make list of most frequently occurring non dictionary words, these might be abbreviations..
  • add most frequently occuring english abbreviations to the list