Difference between revisions of "Automatic text normalisation"

Latest revision as of 13:37, 23 March 2014

Diacritic restoration
Reduplicated character reduction
- How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
- Maybe this will be too heavy for the on the run application ( needs discussion )
Is it possible to identify sub-spans of text ? e.g.
- LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
- [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]
Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
- So we can probably do this to a certain extent LR in a single pass.
- We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
- It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en".
- It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.

Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words
add most frequently occuring english abbreviations to the list
From Comments = tu -> tú, not tuilleadh
change some_known capitals for diff. languages
suggestions for including spelling correction
- Example, Taisbeánta should be Taispeána
- repetitions haha hehe can be included for this as well
- thought for such a single repitition
- should all replacements go through n-gram verification?
- Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down
- Scope for addition of rules... vowels are not repeated
- mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this
- only characters which get repeated are ll nn rr

@@ Line 1: / Line 1: @@
+{{TOCD}}
 ==General ideas==
@@ Line 19: / Line 20: @@
 ** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.
+===Literature===
+* http://aclweb.org/anthology/C/C12/C12-2029.pdf
+* http://www.academia.edu/3042310/Detection_of_language_boundary_in_code-switching_utterances_by_biphone_probabilities
+* http://aclweb.org/anthology//D/D08/D08-1102.pdf
+* http://aclweb.org/anthology//C/C82/C82-1023.pdf
+* Mike Rosner... "A tagging algorithm for mixed language identification in a noisy domain"
 ==To do list==
@@ Line 35: / Line 42: @@
 ** Scope for addition of rules... vowels are not repeated
 ** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this
+** only characters which get repeated are ll nn rr
 [[Category:Development]]