Difference between revisions of "Automatic text normalisation"

From Apertium
Jump to navigation Jump to search
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}


==General ideas==
==General ideas==
Line 11: Line 12:
** Maybe this will be too heavy for the on the run application ( needs discussion )
** Maybe this will be too heavy for the on the run application ( needs discussion )
* Is it possible to identify sub-spans of text ? e.g.
* Is it possible to identify sub-spans of text ? e.g.
** LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!''
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
** So we can probably do this to a certain extent LR in a single pass.
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
** It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en".
** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.

===Literature===

* http://aclweb.org/anthology/C/C12/C12-2029.pdf
* http://www.academia.edu/3042310/Detection_of_language_boundary_in_code-switching_utterances_by_biphone_probabilities
* http://aclweb.org/anthology//D/D08/D08-1102.pdf
* http://aclweb.org/anthology//C/C82/C82-1023.pdf
* Mike Rosner... "A tagging algorithm for mixed language identification in a noisy domain"


==To do list==
==To do list==
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
*Make list of most frequently occurring non dictionary words, these might be abbreviations..
*Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words
*add most frequently occurring english abbreviations to the list
*add most frequently occuring english abbreviations to the list
*'''From Comments''' = tu -> tú, not tuilleadh

*change some_known capitals for diff. languages
*suggestions for including spelling correction
**Example, Taisbeánta should be Taispeána
**repetitions haha hehe can be included for this as well
** thought for such a single repitition
** should all replacements go through n-gram verification?
** Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down
** Scope for addition of rules... vowels are not repeated
** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this
** only characters which get repeated are ll nn rr
[[Category:Development]]
[[Category:Development]]

Latest revision as of 13:37, 23 March 2014

General ideas[edit]

  • Diacritic restoration
  • Reduplicated character reduction
    • How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?

Code switching[edit]

  • For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list...
    • Maybe this will be too heavy for the on the run application ( needs discussion )
  • Is it possible to identify sub-spans of text ? e.g.
    • LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!
    • [en LOL rte showin dáil in irish 4] [ga seachtan na gaeilge, an ceann comhairle] [en hasnt a scooby wots bein sed! his face is classic ha!]
  • Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure.
    • So we can probably do this to a certain extent LR in a single pass.
    • We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+
    • It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en".
    • It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.

Literature[edit]

To do list[edit]

  • Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment
  • Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words
  • add most frequently occuring english abbreviations to the list
  • From Comments = tu -> tú, not tuilleadh
  • change some_known capitals for diff. languages
  • suggestions for including spelling correction
    • Example, Taisbeánta should be Taispeána
    • repetitions haha hehe can be included for this as well
    • thought for such a single repitition
    • should all replacements go through n-gram verification?
    • Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down
    • Scope for addition of rules... vowels are not repeated
    • mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this
    • only characters which get repeated are ll nn rr