Difference between revisions of "Automatic text normalisation"
Jump to navigation
Jump to search
Line 10: | Line 10: | ||
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... |
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... |
||
** Maybe this will be too heavy for the on the run application ( needs discussion ) |
** Maybe this will be too heavy for the on the run application ( needs discussion ) |
||
==To do list== |
|||
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment |
|||
*Make list of most frequently occurring non dictionary words, these might be abbreviations.. |
|||
*add most frequently occuring english abbreviations to the list |
Revision as of 12:55, 23 March 2014
General ideas
- Diacritic restoration
- Reduplicated character reduction
- How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?
Code switching
- For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
- Maybe this will be too heavy for the on the run application ( needs discussion )
To do list
- Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
- Make list of most frequently occurring non dictionary words, these might be abbreviations..
- add most frequently occuring english abbreviations to the list