From Apertium
< User:Ksnmi
Revision as of 03:22, 20 March 2014 by Ksnmi (talk | contribs)
Jump to navigation Jump to search

Some details about myself

  • Name : Akshay Minocha
  • E-mail address : |
  • Other information that may be useful to contact you: nick on the #apertium channel: ksnmi
  • Why is it you are interested in machine translation?
    • I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process.
  • Why is it that they are interested in the Apertium project?
    • This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.
  • Which of the published tasks are you interested in? What do you plan to do?
    • I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.
  • Include a proposal, including
    • a title,
    • reasons why Google and Apertium should sponsor it - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.
  • And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.
  1. Draft version at the moment ( 13th March, 2014 )

Coding Task

  • Points and my progress on the Coding Task that was posted on the Ideas page of this project ->
    • A test corpus from tweets collected earlier, has been collected. Some general trends were seen in the case of non-standard input. Most frequented sample set is put on Link
    • In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011), the translation by apertium and also my comment on each of the translations.

Corpus Creation

Separate task on Corpus Creation ->

  • I created several types of non-standard corpus for the purpose of analysis, and have taken the above set of 50 tweets from random parts of these.
    • With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here Emoticons_NON_Standard
      'Number of Posts -> 475,179
      Link -> Emoticon_dataset
    • Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters.
      Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> Abbreviations_english
      Number of Posts -> 94,290
      Link -> abbreviations_english_dataset
    • Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing.
      Number of Posts -> 411,404
      Link -> Extended_words_dataset

I analysed the most common categories of non-standard text occurrences and have summed it up below, These are if handled in the sequence of their mention below, would create the most effective standard text ->

  • Use of content specific terms ->
    • Such as RT (ReTweet) @<referral> and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets.
  • Handling Links(Imp) ->
    • Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.
      Suggestion -> Links at the moment are not being ignored. They are marked with a
      *(unknown) This should be noted and corrected. As machine translation on the links
      changes the purpose of the same.
      For example, say en->es translation of -> )
      Current translation by Apertium ->
      which is incorrect. The above example would re-direct us to an undesirable page.
  • Use of Emoticons ->
    • People use emoticons very frequently in posts. These have to be ignored.
      Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following -> Emoticons most commonly used (546) (Already mentioned above)
      Solution ->
      If we want the expression not to be lost in translation then these can be kept as it is. Otherwise if apertium treats them as punctuators we should remove them.
      Since the popular one’s include characters and words as well. We WON’T be using regular expressions which would limit our reach.
  • Use of Repetitive or Extended Words ->
    • This is the most commonly occurring issue in the non-standard text.
      Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary.
      The dictionary can be replaced by any other word list and the output will work properly accordingly.
      Sample Input1 ->
      Output2 (at the end of the processing)
      Our final aim is to -> reduce these words in a similar fashion as described above and then match them.
      It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as
      “uuuu” is given which would standardize to “you” so “uuuu”->”u”->”you”<bt/> Hence abbreviation processing should always be after this step. Preferably at the end.
    • Punctuation repetition is not a problem for us.
      Since Apertium handles !!!' similar to !
  • Handling of Hashtags ->
    • Cases in Hashtags ->
      • Words are separated by Capitals
        For example, #ForLife -> For Life
      • Words are not separated by Capitals
        For example, #Fridayafterthenext
    • Solution -
      Hashtag disambiguation can be easily done by any of the two ways -> We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier.
      It is Important to separate the words mentioned in the hashtags. Hashtags are supposed to convey the emotion or the summary of the tweet. Hence most frequent not in context to the grammatical surroundings.
    • So Words in hashtags should be represented as a ‘lone sentence’.
      Example, “Today comes monday again, #whereismyextrasunday” ->
      Today comes monday again. “Where is my extra Sunday”
  • Abbreviation and Acronyms ->
    • In the tweets by matching the most frequently occurring non dictionary words, I came up

with the list of a few abbreviations.
These are -> English_abbreviations_list_non_standard
The solution to improve translation due to the occurrence of these is simple.
When we know what their full form is, we can simply trade places as the final step of the processing towards standard input.
Abbreviation of single character representations such as r->are, u->you, 2->to are also included. This list can be increased by further analysing the data.

  • Spelling mistakes ->
    • These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping.
      Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. Although this algorithm worked well, and is also implemented by the PyEnchant library on python
      >> d = enchant.request_dict("en_US")
      >> d.suggest("Helo")
      ['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
      this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language.
      Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems.
      Alternately from the large bag of words we can also probabilistically find out the most likely spelling for the word.
  • Apostrophe correction ->
    • There are some words where we can predict easily whether the apostrophe exists or not
      for example - theyll -> they’ll
      or im -> i’m
      but ambiguity exists in words like - >
      hell -> he’ll or hell ?
      shell -> she’ll or shell ?
      Here the apostrophe makes a difference in the total sense of the words as they are two completely different words.
      This can be improved by using the predicting mechanism discussed where the trigram probabilities of the text from the standard corpus will be compared and the results will be reported.
      List of apostrophe occurrences from a standard corpus collected by me earlier -> List of apostrophe occurrences_standard_English
  • Spacing and hyphen variation & optional hyphen ->
      • Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus ( either what apertium is currently using or we can come up with something real quick using the technique described in my paper) -> ( Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. )
        With this we can use a trigram based model( or higher n-gram) to predict the most probably occurring word. We can also train on the reference corpus to predict the word.
        After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy.


The project in effectively important since non standard text is not handled by many MT systems, and it is important because we have to go with the trend of the language used today to convey the meaning intact to a different native speaker.