Difference between revisions of "Automatic text normalisation"
Jump to navigation
Jump to search
Line 13: | Line 13: | ||
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!'' |
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!'' |
||
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']''' |
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']''' |
||
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure. |
|||
** So we can probably do this to a certain extent LR in a single pass. |
|||
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+ |
|||
==To do list== |
==To do list== |
Revision as of 13:15, 23 March 2014
General ideas
- Diacritic restoration
- Reduplicated character reduction
- How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?
Code switching
- For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
- Maybe this will be too heavy for the on the run application ( needs discussion )
- Is it possible to identify sub-spans of text ? e.g.
- LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
- [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]
- Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
- So we can probably do this to a certain extent LR in a single pass.
- We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
To do list
- Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
- Make list of most frequently occurring non dictionary words, these might be abbreviations..
- add most frequently occurring english abbreviations to the list