Difference between revisions of "User:Ksingla025/Application"
Ksingla025 (talk | contribs) (Normalisation of Non-Standard Text Input) |
|||
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Draft Proposal |
|||
Do we need to add a tokenizer to the normalizer( can include basic cleaning like extra punctuation marks( ... ), unwanted characters(#karan) etc ?? |
|||
This project needs serious attention as most of the time, MT is used for a text which contains non-standard input and use of non-standard text is increasing in the Social interactions. So Apertium need to handle the disturbances in the text well. |
|||
1) Single Character Words |
|||
Problem : |
|||
Coding Challenge Given : |
|||
Many english words having the sound similar to an alphabet are often replaced by the corresponding alphabets to reduce the character length of a text. |
|||
1. Collected a sample of 2000 tweets to analyze the common patterns, some chat data, and also made a literature survey to check for types of non-standard inputs. |
|||
sample data : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit |
|||
Example: |
|||
Translation from Apertium : |
|||
Non-Standard Text: I c u. |
|||
https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit |
|||
Standard Form : I see you. |
|||
2. It handled mainly the problems of : extended words(like: smiiiilleeee, llovveeee) |
|||
Solution Proposed : |
|||
Was divided into two phases: |
|||
As the characters are limited in number (i.e 26), the number of such cases are also limited. So by observing a big data a hash map of alphabet and the words with similar IPA pronunciation is generated. |
|||
Phase 1 : Generate possible candidates for an elongated word |
|||
Example: |
|||
input : I ammmm goingg tooo Lonndoon :) |
|||
b -> be |
|||
output : ^I/I$ ^ammmm/ammmm/amm/am$ ^goingg/goingg/goingg/going$ ^tooo/tooo/too/to$ ^Lonndoon/Lonndoon/Lonndoon/Lonndon/Londoon/London$ ^:)/{emotion}$ ^:p/{emotion}$ |
|||
c -> see |
|||
s -> ass |
|||
etc. |
|||
Codes can be found in the GitHub Repository |
|||
Many characters might be having multiple mappings like |
|||
GitHub Repository: https://github.com/ksingla025/Normalizer |
|||
n -> an, and |
|||
Phase 2 : Phase 1 reduced the tokens into possible candidates for that token so that they can be matched to a word in the morphological dictionary. |
|||
p- -> pee, pea |
|||
===> Make a dictionary look up and take the first word from it. |
|||
q -> question, queue |
|||
t-> tea, tee |
|||
v-> we wee |
|||
x -> axe, times |
|||
There are various other problems with apertium, which I noted using the data and will discuss them in detail below with a proposed solution : |
|||
So to escape from ambiguity we make a language model and select the most suitable word accordingly. |
|||
1. Elongated Words |
|||
Problem : words like loooooveeeee, mussttt etc |
|||
Solution : Discussed in the coding challenge |
|||
2. Smileys or emoticons |
|||
Doubt : Language Model will just act like a text file, and will be loaded into the database to make the call. Is it possible ?? |
|||
There is a lot more disturbance one can expect. |
|||
2.1. Generallly Used (”:) :p :* <3 :|”) : |
|||
2) Smileys |
|||
Smileys, also known as "emoticons," are glyphs used to convey emotions in your writing. |
|||
Solution : Make a list of them and write a regex. a sample was added in the coding challenge. |
|||
sample list : Check for “smileys_list” in the github repository |
|||
Simply each symbol will be mapped to a emotion or just add a regular expression for all |
|||
2.2. Unusual Disturbances : Generally some tokens, which have unnecessary punctuations or may contain numbers or alphabets. Such disturbances can be regarded to {emoticon}. |
|||
3) Extended Words |
|||
I collected some general smileys used in twitter, look for “twitter_smileys” in github repository for the list. |
|||
Already handled in the coding challenge(lllooooovvveee, nasssaaaa) |
|||
3. Abbreviations |
|||
4.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry} |
|||
3.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry} |
|||
4.2) Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea ) |
|||
sample : “abbreviations” file in the github repository |
|||
3.2) Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea ) |
|||
Will handle three types of cases well. |
Will handle three types of cases well. |
||
a) Delete vowels and possibly sonorant consonants (hdwd for hardwood) |
a) Delete vowels and possibly sonorant consonants (hdwd for hardwood) |
||
b) Delete all but the first syllable (ceil for ceiling) |
b) Delete all but the first syllable (ceil for ceiling) |
||
* but works well for deletion(but not very good for substitution). |
|||
Ex-> But -> bt |
Ex-> But -> bt |
||
Sample Code : available in the github repository “sample_linear_regression.py”, it works fine, try it or modify it to train on a larger abbriviated paralllel data to get best results |
|||
Doubt :Already Implemented with a small training set,works fine. What do you think about this ?? |
|||
3.3) Predicting words from a abbreviation using “Decision tree based search” using WFST(Weighted FST), will work for some. |
|||
similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333. |
similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333. |
||
WFST part is quite similar to the part, I used in my paper “Enhancing ASR by MT using Hindi WordNet” in ICON-2013 with Aniruddha Tammewar,Srinivas Bangalore and Michael Carl |
|||
NOTE : For 4.2 and 4.3 data collected in 4.1 will be used. |
|||
Training data for 3.2 and 3.3 : There is a long list of abbreviations available on the web, but for this they need to filtered to find the one’s which satisfy 3.2(a & b) conditions. |
|||
4. Hashtags( easily hand-able using regex ) |
|||
Need suggestions on this ?? |
|||
problem : they can be of two types : #YoAreSocute |
|||
#Yoaresocute |
|||
Solution : First type is easy to handle, for second, who need to generate the possible words. So a dictionary look up for possible words or FST made of a dictionary can be used. |
|||
5. Unknown Characters in Words( works as 3) |
|||
Problem : Sh**, f**k, ki** |
|||
Solution : Solve them as elongated words were handled, or a domain can be kept in which people use these words most oftenly. |
|||
Further disambiguation can be done on the basics of a language model, made respectively from monolingual corpus of each language in Apertium. |
|||
6. Years |
|||
Problem : Apertium doesn’t handles 1980s, 80s, #1980 |
|||
Other Ideas : |
|||
Solution : make a simple rule to handle such cases. |
|||
Words like "gr8" can be looked to be solved using recursive phonetic representation of words( 8 has mapping "eyt" and there is word "eat" which has mapping "eyt". |
|||
Suggestions needed ?? |
|||
7. Other Research Ideas Involved : |
|||
If one can use moses, to train a system, then one can train a system on parallel data that looks like : |
|||
Source Text(abbreviations) : h d w d |
|||
Other Noted Problems in the Apertium |
|||
Target Text(Full forms) : h a r d w o o d |
|||
It will learn the alignments and then by character Language modelling it will lessen the options for the output words . |
|||
5) special symbols (like @, a mapping needs to be made) |
|||
Then upon it we can use Word based language model to further disambiguate. |
|||
ALTERNATE : Instead of language modelling take the output, which is there in dictionary and is most widely used in the language, possible in Apertium |
|||
6) Hashtags( easily hand-able using regex ) |
|||
7) Sh*t( works as 3) |
|||
8. End Goal |
|||
Handle the problems and solve them with regard to apertium, along with this there can be a possibility to work on 7. , and report results for improvement in Apertium’s output and also other MT system trained on Europal or standard datasets reported in WMT’14 |
|||
[[Category:GSoC 2014 Student proposals|Ksingla025]] |
|||
8) years 80s,1980s,#1980( can be handled) |
Latest revision as of 21:23, 14 March 2014
Draft Proposal
This project needs serious attention as most of the time, MT is used for a text which contains non-standard input and use of non-standard text is increasing in the Social interactions. So Apertium need to handle the disturbances in the text well.
Coding Challenge Given : 1. Collected a sample of 2000 tweets to analyze the common patterns, some chat data, and also made a literature survey to check for types of non-standard inputs.
sample data : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit Translation from Apertium : https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit
2. It handled mainly the problems of : extended words(like: smiiiilleeee, llovveeee) Was divided into two phases:
Phase 1 : Generate possible candidates for an elongated word input : I ammmm goingg tooo Lonndoon :) output : ^I/I$ ^ammmm/ammmm/amm/am$ ^goingg/goingg/goingg/going$ ^tooo/tooo/too/to$ ^Lonndoon/Lonndoon/Lonndoon/Lonndon/Londoon/London$ ^:)/{emotion}$ ^:p/{emotion}$
Codes can be found in the GitHub Repository GitHub Repository: https://github.com/ksingla025/Normalizer
Phase 2 : Phase 1 reduced the tokens into possible candidates for that token so that they can be matched to a word in the morphological dictionary. ===> Make a dictionary look up and take the first word from it.
There are various other problems with apertium, which I noted using the data and will discuss them in detail below with a proposed solution :
1. Elongated Words
Problem : words like loooooveeeee, mussttt etc Solution : Discussed in the coding challenge
2. Smileys or emoticons
There is a lot more disturbance one can expect. 2.1. Generallly Used (”:) :p :* <3 :|”) : Solution : Make a list of them and write a regex. a sample was added in the coding challenge. sample list : Check for “smileys_list” in the github repository 2.2. Unusual Disturbances : Generally some tokens, which have unnecessary punctuations or may contain numbers or alphabets. Such disturbances can be regarded to {emoticon}.
I collected some general smileys used in twitter, look for “twitter_smileys” in github repository for the list.
3. Abbreviations
3.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}
sample : “abbreviations” file in the github repository
3.2) Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea ) Will handle three types of cases well. a) Delete vowels and possibly sonorant consonants (hdwd for hardwood) b) Delete all but the first syllable (ceil for ceiling) * but works well for deletion(but not very good for substitution).
Ex-> But -> bt
Sample Code : available in the github repository “sample_linear_regression.py”, it works fine, try it or modify it to train on a larger abbriviated paralllel data to get best results
3.3) Predicting words from a abbreviation using “Decision tree based search” using WFST(Weighted FST), will work for some. similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.
WFST part is quite similar to the part, I used in my paper “Enhancing ASR by MT using Hindi WordNet” in ICON-2013 with Aniruddha Tammewar,Srinivas Bangalore and Michael Carl
Training data for 3.2 and 3.3 : There is a long list of abbreviations available on the web, but for this they need to filtered to find the one’s which satisfy 3.2(a & b) conditions.
4. Hashtags( easily hand-able using regex )
problem : they can be of two types : #YoAreSocute #Yoaresocute Solution : First type is easy to handle, for second, who need to generate the possible words. So a dictionary look up for possible words or FST made of a dictionary can be used.
5. Unknown Characters in Words( works as 3)
Problem : Sh**, f**k, ki** Solution : Solve them as elongated words were handled, or a domain can be kept in which people use these words most oftenly. Further disambiguation can be done on the basics of a language model, made respectively from monolingual corpus of each language in Apertium.
6. Years
Problem : Apertium doesn’t handles 1980s, 80s, #1980 Solution : make a simple rule to handle such cases.
7. Other Research Ideas Involved :
If one can use moses, to train a system, then one can train a system on parallel data that looks like :
Source Text(abbreviations) : h d w d Target Text(Full forms) : h a r d w o o d
It will learn the alignments and then by character Language modelling it will lessen the options for the output words . Then upon it we can use Word based language model to further disambiguate.
ALTERNATE : Instead of language modelling take the output, which is there in dictionary and is most widely used in the language, possible in Apertium
8. End Goal
Handle the problems and solve them with regard to apertium, along with this there can be a possibility to work on 7. , and report results for improvement in Apertium’s output and also other MT system trained on Europal or standard datasets reported in WMT’14