Difference between revisions of "Automatic text normalisation"
Jump to navigation
Jump to search
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
==General ideas== |
==General ideas== |
||
Line 10: | Line 11: | ||
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... |
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... |
||
** Maybe this will be too heavy for the on the run application ( needs discussion ) |
** Maybe this will be too heavy for the on the run application ( needs discussion ) |
||
* Is it possible to identify sub-spans of text ? e.g. |
|||
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!'' |
|||
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']''' |
|||
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure. |
|||
** So we can probably do this to a certain extent LR in a single pass. |
|||
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+ |
|||
** It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en". |
|||
** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga. |
|||
===Literature=== |
|||
* http://aclweb.org/anthology/C/C12/C12-2029.pdf |
|||
* http://www.academia.edu/3042310/Detection_of_language_boundary_in_code-switching_utterances_by_biphone_probabilities |
|||
* http://aclweb.org/anthology//D/D08/D08-1102.pdf |
|||
* http://aclweb.org/anthology//C/C82/C82-1023.pdf |
|||
* Mike Rosner... "A tagging algorithm for mixed language identification in a noisy domain" |
|||
==To do list== |
|||
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment |
|||
*Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words |
|||
*add most frequently occuring english abbreviations to the list |
|||
*'''From Comments''' = tu -> tú, not tuilleadh |
|||
*change some_known capitals for diff. languages |
|||
*suggestions for including spelling correction |
|||
**Example, Taisbeánta should be Taispeána |
|||
**repetitions haha hehe can be included for this as well |
|||
** thought for such a single repitition |
|||
** should all replacements go through n-gram verification? |
|||
** Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down |
|||
** Scope for addition of rules... vowels are not repeated |
|||
** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this |
|||
** only characters which get repeated are ll nn rr |
|||
[[Category:Development]] |
Latest revision as of 13:37, 23 March 2014
General ideas[edit]
- Diacritic restoration
- Reduplicated character reduction
- How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?
Code switching[edit]
- For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
- Maybe this will be too heavy for the on the run application ( needs discussion )
- Is it possible to identify sub-spans of text ? e.g.
- LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
- [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]
- Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
- So we can probably do this to a certain extent LR in a single pass.
- We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
- It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en".
- It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.
Literature[edit]
- http://aclweb.org/anthology/C/C12/C12-2029.pdf
- http://www.academia.edu/3042310/Detection_of_language_boundary_in_code-switching_utterances_by_biphone_probabilities
- http://aclweb.org/anthology//D/D08/D08-1102.pdf
- http://aclweb.org/anthology//C/C82/C82-1023.pdf
- Mike Rosner... "A tagging algorithm for mixed language identification in a noisy domain"
To do list[edit]
- Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
- Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words
- add most frequently occuring english abbreviations to the list
- From Comments = tu -> tú, not tuilleadh
- change some_known capitals for diff. languages
- suggestions for including spelling correction
- Example, Taisbeánta should be Taispeána
- repetitions haha hehe can be included for this as well
- thought for such a single repitition
- should all replacements go through n-gram verification?
- Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down
- Scope for addition of rules... vowels are not repeated
- mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this
- only characters which get repeated are ll nn rr