User:Ksingla025/Application

From Apertium
< User:Ksingla025
Revision as of 23:26, 13 March 2014 by Ksingla025 (talk | contribs) (Normalization of Non-Standard Text Input)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

1) Single Character Words Problem :

Many english words having the sound similar to an alphabet are often replaced by the corresponding alphabets to reduce the character length of a text. Example: Non-Standard Text: I c u. Standard Form : I see you.

Solution Proposed : As the characters are limited in number (i.e 26), the number of such cases are also limited. So by observing a big data a hash map of alphabet and the words with similar IPA pronunciation is generated. Example: b -> be c -> see s -> ass etc.

Many characters might be having multiple mappings like n -> an, and p- -> pee, pea q -> question, queue t-> tea, tee v-> we wee x -> axe, times

So to escape from ambiguity we make a language model and select the most suitable word accordingly.





2) Smilies Smileys, also known as "emoticons," are glyphs used to convey emotions in your writing.

3) Extended Wods 4) Abbreviations 4.1) Frequent [non dictionary words] 4.2) Train a Linear reg model from “words” to abbreviations that have a singla word mapping, and extend the dictionary to new dictionary. ( not a good idea ) works well for deletion but not very good for substitution Ex-> But -> bt


5) Vowels dropp 6)string with numbers 7) special symboles (ike @) 8) Hastags 9) Sh*t 10) years 80s