Difference between revisions of "Automatic text normalisation"

Revision as of 13:09, 23 March 2014

Diacritic restoration
Reduplicated character reduction
- How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
- Maybe this will be too heavy for the on the run application ( needs discussion )
Is it possible to identify sub-spans of text ? e.g.
- LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
- [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]

Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
Make list of most frequently occurring non dictionary words, these might be abbreviations..
add most frequently occurring english abbreviations to the list

@@ Line 10: / Line 10: @@
 * For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
 ** Maybe this will be too heavy for the on the run application ( needs discussion )
+* Is it possible to identify sub-spans of text ? e.g.
+** LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
+** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''
 ==To do list==
 *Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
 *Make list of most frequently occurring non dictionary words, these might be abbreviations..
-*add most frequently occuring english abbreviations to the list
+*add most frequently occurring english abbreviations to the list
+[[Category:Development]]