Difference between revisions of "User:Ksingla025/Application"

From Apertium
Jump to navigation Jump to search
(Normalisation of Non-Standard Text Input)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
Draft Proposal
I am Karan Singla, currently persuing my MS in Computational Linguiscts from IIIT-Hyderabad, India.
 
My Research Interests mainly include : Machine Translation, Machine Learning, Parsing, Speech Recognition
 
   
  +
This project needs serious attention as most of the time, MT is used for a text which contains non-standard input and use of non-standard text is increasing in the Social interactions. So Apertium need to handle the disturbances in the text well.
   
  +
Coding Challenge Given :
In Apertium I am working on building a normalizer for Non-Standard text, if you have suggestions or comments i will be thankful !!
 
  +
1. Collected a sample of 2000 tweets to analyze the common patterns, some chat data, and also made a literature survey to check for types of non-standard inputs.
   
  +
sample data : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit
Do we need to add a tokenizer to the normalizer( can include basic cleaning like extra punctuation marks( ... ), unwanted characters(#karan) etc ??
 
  +
Translation from Apertium :
  +
https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit
   
  +
2. It handled mainly the problems of : extended words(like: smiiiilleeee, llovveeee)
1) Single Character Words
 
  +
Was divided into two phases:
Problem :
 
  +
Phase 1 : Generate possible candidates for an elongated word
  +
input : I ammmm goingg tooo Lonndoon :)
  +
output : ^I/I$ ^ammmm/ammmm/amm/am$ ^goingg/goingg/goingg/going$ ^tooo/tooo/too/to$ ^Lonndoon/Lonndoon/Lonndoon/Lonndon/Londoon/London$ ^:)/{emotion}$ ^:p/{emotion}$
   
  +
Codes can be found in the GitHub Repository
Many english words having the sound similar to an alphabet are often replaced by the corresponding alphabets to reduce the character length of a text.
 
  +
GitHub Repository: https://github.com/ksingla025/Normalizer
  +
Phase 2 : Phase 1 reduced the tokens into possible candidates for that token so that they can be matched to a word in the morphological dictionary.
  +
===> Make a dictionary look up and take the first word from it.
   
  +
There are various other problems with apertium, which I noted using the data and will discuss them in detail below with a proposed solution :
Example:
 
Non-Standard Text: I c u.
 
Standard Form : I see you.
 
   
  +
1. Elongated Words
Solution Proposed :
 
  +
Problem : words like loooooveeeee, mussttt etc
As the characters are limited in number (i.e 26), the number of such cases are also limited. So by observing a big data a hash map of alphabet and the words with similar IPA pronunciation is generated.
 
  +
Solution : Discussed in the coding challenge
Example:
 
b -> be
 
c -> see
 
s -> ass
 
etc.
 
 
Many characters might be having multiple mappings like
 
n -> an, and
 
p- -> pee, pea
 
q -> question, queue
 
t-> tea, tee
 
v-> we wee
 
x -> axe, times
 
 
So to escape from ambiguity we make a language model and select the most suitable word accordingly.
 
 
 
Doubt : Language Model will just act like a text file, and will be loaded into the database to make the call. Is it possible ??
 
 
2) Smileys
 
Smileys, also known as "emoticons," are glyphs used to convey emotions in your writing.
 
 
Simply each symbol will be mapped to a emotion or just add a regular expression for all
 
 
3) Extended Words
 
Already handled in the coding challenge(lllooooovvveee, nasssaaaa)
 
 
4) Abbreviations(Need Effort)
 
4.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}
 
   
  +
2. Smileys or emoticons
  +
There is a lot more disturbance one can expect.
  +
2.1. Generallly Used (”:) :p :* <3 :|”) :
  +
  +
Solution : Make a list of them and write a regex. a sample was added in the coding challenge.
  +
sample list : Check for “smileys_list” in the github repository
  +
  +
2.2. Unusual Disturbances : Generally some tokens, which have unnecessary punctuations or may contain numbers or alphabets. Such disturbances can be regarded to {emoticon}.
  +
I collected some general smileys used in twitter, look for “twitter_smileys” in github repository for the list.
  +
  +
3. Abbreviations
  +
  +
3.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}
  +
sample : “abbreviations” file in the github repository
 
 
4.2) Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea )
+
3.2) Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea )
 
Will handle three types of cases well.
 
Will handle three types of cases well.
 
a) Delete vowels and possibly sonorant consonants (hdwd for hardwood)
 
a) Delete vowels and possibly sonorant consonants (hdwd for hardwood)
 
b) Delete all but the first syllable (ceil for ceiling)
 
b) Delete all but the first syllable (ceil for ceiling)
but works well for deletion(but not very good for substitution.
+
* but works well for deletion(but not very good for substitution).
 
Ex-> But -> bt
 
Ex-> But -> bt
   
  +
Sample Code : available in the github repository “sample_linear_regression.py”, it works fine, try it or modify it to train on a larger abbriviated paralllel data to get best results
Doubt :Already Implemented with a small training set,works fine. What do you think about this ??
 
+
4.3) Predicting words from a abbreviation, will work for some.
+
3.3) Predicting words from a abbreviation using “Decision tree based search” using WFST(Weighted FST), will work for some.
 
similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.
 
similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.
   
  +
WFST part is quite similar to the part, I used in my paper “Enhancing ASR by MT using Hindi WordNet” in ICON-2013 with Aniruddha Tammewar,Srinivas Bangalore and Michael Carl
NOTE : For 4.2 and 4.3 data collected in 4.1 will be used.
 
  +
  +
Training data for 3.2 and 3.3 : There is a long list of abbreviations available on the web, but for this they need to filtered to find the one’s which satisfy 3.2(a & b) conditions.
   
  +
4. Hashtags( easily hand-able using regex )
Need suggestions on this ??
 
  +
  +
problem : they can be of two types : #YoAreSocute
  +
#Yoaresocute
  +
Solution : First type is easy to handle, for second, who need to generate the possible words. So a dictionary look up for possible words or FST made of a dictionary can be used.
  +
  +
5. Unknown Characters in Words( works as 3)
  +
  +
Problem : Sh**, f**k, ki**
  +
Solution : Solve them as elongated words were handled, or a domain can be kept in which people use these words most oftenly.
  +
Further disambiguation can be done on the basics of a language model, made respectively from monolingual corpus of each language in Apertium.
  +
  +
6. Years
   
  +
Problem : Apertium doesn’t handles 1980s, 80s, #1980
Other Ideas :
 
  +
Solution : make a simple rule to handle such cases.
Words like "gr8" can be looked to be solved using recursive phonetic representation of words( 8 has mapping "eyt" and there is word "eat" which has mapping "eyt".
 
   
  +
Suggestions needed ??
 
  +
7. Other Research Ideas Involved :
   
  +
If one can use moses, to train a system, then one can train a system on parallel data that looks like :
   
  +
Source Text(abbreviations) : h d w d
Other Noted Problems in the Apertium
 
  +
Target Text(Full forms) : h a r d w o o d
   
  +
It will learn the alignments and then by character Language modelling it will lessen the options for the output words .
5) special symbols (like @, a mapping needs to be made)
 
  +
Then upon it we can use Word based language model to further disambiguate.
   
  +
ALTERNATE : Instead of language modelling take the output, which is there in dictionary and is most widely used in the language, possible in Apertium
6) Hashtags( easily hand-able using regex )
 
  +
 
  +
7) Sh*t( works as 3)
 
  +
8. End Goal
  +
  +
Handle the problems and solve them with regard to apertium, along with this there can be a possibility to work on 7. , and report results for improvement in Apertium’s output and also other MT system trained on Europal or standard datasets reported in WMT’14
   
  +
[[Category:GSoC 2014 Student proposals|Ksingla025]]
8) years 80s,1980s,#1980( can be handled)
 

Latest revision as of 21:23, 14 March 2014

Draft Proposal

This project needs serious attention as most of the time, MT is used for a text which contains non-standard input and use of non-standard text is increasing in the Social interactions. So Apertium need to handle the disturbances in the text well.

Coding Challenge Given : 1. Collected a sample of 2000 tweets to analyze the common patterns, some chat data, and also made a literature survey to check for types of non-standard inputs.

sample data : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit Translation from Apertium : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit

2. It handled mainly the problems of : extended words(like: smiiiilleeee, llovveeee) Was divided into two phases:

Phase 1 : Generate  possible candidates for an elongated word
   input : I ammmm goingg tooo Lonndoon :)
   output : ^I/I$ ^ammmm/ammmm/amm/am$ ^goingg/goingg/goingg/going$ ^tooo/tooo/too/to$ ^Lonndoon/Lonndoon/Lonndoon/Lonndon/Londoon/London$ ^:)/{emotion}$ ^:p/{emotion}$

Codes can be found in the GitHub Repository GitHub Repository: https://github.com/ksingla025/Normalizer

Phase 2 : Phase 1 reduced the tokens into possible candidates for that token so that they can be matched to a word in the morphological dictionary.
     ===> Make a dictionary look up and take the first word from it.

There are various other problems with apertium, which I noted using the data and will discuss them in detail below with a proposed solution :

1. Elongated Words

   Problem : words like loooooveeeee, mussttt etc
   Solution : Discussed in the coding challenge

2. Smileys or emoticons

   There is a lot more disturbance one can expect. 
    2.1. Generallly Used (”:) :p :* <3 :|”) :
       
      Solution : Make a list of them and write a regex. a sample was added in the coding   challenge.
      sample list : Check for “smileys_list” in the github repository
   
    2.2. Unusual Disturbances : Generally some tokens, which have unnecessary  punctuations or may contain numbers or alphabets. Such disturbances can be regarded to {emoticon}. 

I collected some general smileys used in twitter, look for “twitter_smileys” in github repository for the list.

3. Abbreviations

       3.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}

sample : “abbreviations” file in the github repository

        3.2)  Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea )
           Will handle three types of cases well.
               a) Delete vowels and possibly sonorant consonants (hdwd for hardwood)
               b) Delete all but the first syllable (ceil for ceiling)
               * but works well for  deletion(but not very good for substitution).

Ex-> But -> bt

Sample Code : available in the github repository “sample_linear_regression.py”, it works fine, try it or modify it to train on a larger abbriviated paralllel data to get best results

       3.3) Predicting words from a abbreviation using “Decision tree based search” using WFST(Weighted FST), will work for some.
          similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.

WFST part is quite similar to the part, I used in my paper “Enhancing ASR by MT using Hindi WordNet” in ICON-2013 with Aniruddha Tammewar,Srinivas Bangalore and Michael Carl

Training data for 3.2 and 3.3 : There is a long list of abbreviations available on the web, but for this they need to filtered to find the one’s which satisfy 3.2(a & b) conditions.

4. Hashtags( easily hand-able using regex )

    problem : they can be of two types : #YoAreSocute
                                                          #Yoaresocute
    Solution : First type is easy to handle, for second, who need to generate the possible words. So a dictionary look up for possible words or FST made of a dictionary can be used.
 

5. Unknown Characters in Words( works as 3)

   Problem : Sh**, f**k, ki**
   Solution : Solve them as elongated words were handled, or a domain can be kept in which people use these words most oftenly.
                  Further disambiguation can be done on the basics of a language model, made respectively from monolingual corpus of each language in Apertium.
 

6. Years

   Problem :  Apertium doesn’t handles 1980s, 80s, #1980
   Solution : make a simple rule to handle such cases.


7. Other Research Ideas Involved :

If one can use moses, to train a system, then one can train a system on parallel data that looks like :

  Source Text(abbreviations) : h d w d  
  Target Text(Full forms)     : h a r d w o o d

It will learn the alignments and then by character Language modelling it will lessen the options for the output words . Then upon it we can use Word based language model to further disambiguate.

ALTERNATE : Instead of language modelling take the output, which is there in dictionary and is most widely used in the language, possible in Apertium


8. End Goal

Handle the problems and solve them with regard to apertium, along with this there can be a possibility to work on 7. , and report results for improvement in Apertium’s output and also other MT system trained on Europal or standard datasets reported in WMT’14