From Apertium
< User:Ksnmi
Revision as of 04:48, 20 March 2014 by Ksnmi (talk | contribs)
Jump to navigation Jump to search


This section contains some points of introduction from my side.

  • Name : Akshay Minocha
  • E-mail address : |
  • Other information that may be useful to contact you: nick on the #apertium channel: ksnmi
  • Why is it you are interested in machine translation?
    • I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process.
  • Why is it that they are interested in the Apertium project?
    • This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.
  • Which of the published tasks are you interested in? What do you plan to do?
    • I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.
  • Include a proposal, including
    • Reasons why Google and Apertium should sponsor it - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.
  • And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.

Link to the workplan

Coding Challenges

Analysing the issues in non-standard data

I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task.
Details of the analysis can be found on the following link - Link
In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011).

The Extended word reduction task (Mailing list)

At the moment this works for English using the wordlist generated from the English dictionary. 
The dictionary can be replaced by any other word list and the output will work properly accordingly.
Sample Input1 ->
Output2 (at the end of the processing)

Corpus Creation

Separate task on Corpus Creation for English ->

  • With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here Emoticons_NON_Standard
    'Number of Posts -> 475,179
    Link -> Emoticon_dataset
  • Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters.
    Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> Abbreviations_english
    Number of Posts -> 94,290
    Link -> abbreviations_english_dataset
  • Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing.
    Number of Posts -> 411,404
    Link -> Extended_words_dataset

Non Standard features in the Text

I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.

Literature Review

There are many sites [1], [2], [3] on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. [4]

  • There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include [5] Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks.
    Issues in this case -
    • Working on building a bi-lingual resource from in-domain data.
    • Other sources of non standard data don’t see to get a significant improvement
    • BLEU score improvement marginal

  • This is a standard research[6] on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage.
    The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.

  • This [7] is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker[8] as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium.
  • This research [9] is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most.
  • This research [10] identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc.
    They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better.
  • This research[11] idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model.
  • Inspired by this [12] research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.


  4. Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.
  5. Jehl, Laura Elisabeth. "Machine translation for twitter." (2010).
  6. Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.
  7. Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011.
  8. Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker.
  9. Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation."
  10. Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013).
  11. Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.
  12. S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7