Difference between revisions of "Automatic text normalisation"

From Apertium
Jump to navigation Jump to search
Line 10: Line 10:
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
** Maybe this will be too heavy for the on the run application ( needs discussion )
** Maybe this will be too heavy for the on the run application ( needs discussion )
* Is it possible to identify sub-spans of text ? e.g.

** LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''


==To do list==
==To do list==
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
*Make list of most frequently occurring non dictionary words, these might be abbreviations..
*Make list of most frequently occurring non dictionary words, these might be abbreviations..
*add most frequently occuring english abbreviations to the list
*add most frequently occurring english abbreviations to the list

[[Category:Development]]

Revision as of 13:09, 23 March 2014

General ideas

  • Diacritic restoration
  • Reduplicated character reduction
    • How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

Code switching

  • For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
    • Maybe this will be too heavy for the on the run application ( needs discussion )
  • Is it possible to identify sub-spans of text ? e.g.
    • LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
    • [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]

To do list

  • Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
  • Make list of most frequently occurring non dictionary words, these might be abbreviations..
  • add most frequently occurring english abbreviations to the list