Difference between revisions of "User:Ksingla025/Application"

From Apertium
Jump to navigation Jump to search
(Normalisation of Non-Standard Text Input)
(Normalisation of Non-Standard Text Input)
Line 39: Line 39:
 
Already handled in the coding challenge(lllooooovvveee, nasssaaaa)
 
Already handled in the coding challenge(lllooooovvveee, nasssaaaa)
 
 
4) Abbreviations
+
4) Abbreviations(Need Effort)
 
4.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}
 
4.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}
   
Line 67: Line 67:
 
Other Noted Problems in the Apertium
 
Other Noted Problems in the Apertium
   
7) special symbols (like @, a mapping needs to be made)
+
5) special symbols (like @, a mapping needs to be made)
  +
8) Hashtags( easily hand-able using regex )
+
6) Hashtags( easily hand-able using regex )
9) Sh*t( works as 3)
 
  +
10) years 80s,1980s,#1980( can be handled)
 
 
7) Sh*t( works as 3)
  +
 
8) years 80s,1980s,#1980( can be handled)

Revision as of 00:05, 14 March 2014

Do we need to add a tokenizer to the normalizer( can include basic cleaning like extra punctuation marks( ... ), unwanted characters(#karan) etc ??

1) Single Character Words

Problem :

Many english words having the sound similar to an alphabet are often replaced by the corresponding alphabets to reduce the character length of a text.

Example: Non-Standard Text: I c u.

     Standard Form      :    	I see you.

Solution Proposed : As the characters are limited in number (i.e 26), the number of such cases are also limited. So by observing a big data a hash map of alphabet and the words with similar IPA pronunciation is generated. Example: b -> be c -> see s -> ass etc.

Many characters might be having multiple mappings like n -> an, and p- -> pee, pea q -> question, queue t-> tea, tee v-> we wee x -> axe, times

So to escape from ambiguity we make a language model and select the most suitable word accordingly.


Doubt : Language Model will just act like a text file, and will be loaded into the database to make the call. Is it possible ??

2) Smileys 

Smileys, also known as "emoticons," are glyphs used to convey emotions in your writing.

Simply each symbol will be mapped to a emotion or just add a regular expression for all

3) Extended Words

Already handled in the coding challenge(lllooooovvveee, nasssaaaa)

4) Abbreviations(Need Effort)

4.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}


       4.2)  Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea )
           Will handle three types of cases well.
               a) Delete vowels and possibly sonorant consonants (hdwd for hardwood)
               b) Delete all but the first syllable (ceil for ceiling)

but works well for deletion(but not very good for substitution. Ex-> But -> bt

           Doubt :Already Implemented with a small training set,works fine. What do you think about this ??
           
       4.3) Predicting words from a abbreviation, will work for some.
          similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.

NOTE : For 4.2 and 4.3 data collected in 4.1 will be used.

         Need suggestions on this ??

Other Ideas :

      Words like "gr8" can be looked to be solved using recursive phonetic representation of words( 8 has mapping "eyt" and there is word "eat" which has mapping "eyt".
       Suggestions needed ??


Other Noted Problems in the Apertium

 5) special symbols (like @, a mapping needs to be made)
 6) Hashtags( easily hand-able using regex )
 
 7) Sh*t( works as 3)
 8) years 80s,1980s,#1980( can be handled)