Automatic text normalisation

From Apertium
Revision as of 13:32, 23 March 2014 by Ksnmi (talk | contribs)
Jump to navigation Jump to search

General ideas

  • Diacritic restoration
  • Reduplicated character reduction
    • How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

Code switching

  • For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
    • Maybe this will be too heavy for the on the run application ( needs discussion )
  • Is it possible to identify sub-spans of text ? e.g.
    • LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
    • [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]
  • Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
    • So we can probably do this to a certain extent LR in a single pass.
    • We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
    • It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en".
    • It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.


To do list

  • Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
  • Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words
  • add most frequently occuring english abbreviations to the list
  • From Comments = tu -> tú, not tuilleadh
  • change some_known capitals for diff. languages
  • suggestions for including spelling correction
    • Example, Taisbeánta should be Taispeána
    • repetitions haha hehe can be included for this as well
    • thought for such a single repitition
    • should all replacements go through n-gram verification?
    • Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down
    • Scope for addition of rules... vowels are not repeated
    • mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this