From Apertium
Jump to navigation Jump to search

Draft Proposal

This project needs serious attention as most of the time, MT is used for a text which contains non-standard input and use of non-standard text is increasing in the Social interactions. So Apertium need to handle the disturbances in the text well.

Coding Challenge Given : 1. Collected a sample of 2000 tweets to analyze the common patterns, some chat data, and also made a literature survey to check for types of non-standard inputs.

sample data : Translation from Apertium :

2. It handled mainly the problems of : extended words(like: smiiiilleeee, llovveeee) Was divided into two phases:

Phase 1 : Generate  possible candidates for an elongated word
   input : I ammmm goingg tooo Lonndoon :)
   output : ^I/I$ ^ammmm/ammmm/amm/am$ ^goingg/goingg/goingg/going$ ^tooo/tooo/too/to$ ^Lonndoon/Lonndoon/Lonndoon/Lonndon/Londoon/London$ ^:)/{emotion}$ ^:p/{emotion}$

Codes can be found in the GitHub Repository GitHub Repository:

Phase 2 : Phase 1 reduced the tokens into possible candidates for that token so that they can be matched to a word in the morphological dictionary.
     ===> Make a dictionary look up and take the first word from it.

There are various other problems with apertium, which I noted using the data and will discuss them in detail below with a proposed solution :

1. Elongated Words

   Problem : words like loooooveeeee, mussttt etc
   Solution : Discussed in the coding challenge

2. Smileys or emoticons

   There is a lot more disturbance one can expect. 
    2.1. Generallly Used (”:) :p :* <3 :|”) :
      Solution : Make a list of them and write a regex. a sample was added in the coding   challenge.
      sample list : Check for “smileys_list” in the github repository
    2.2. Unusual Disturbances : Generally some tokens, which have unnecessary  punctuations or may contain numbers or alphabets. Such disturbances can be regarded to {emoticon}. 

I collected some general smileys used in twitter, look for “twitter_smileys” in github repository for the list.

3. Abbreviations

       3.1) Frequent [non dictionary words]{ abbreviations which are available online and don't have a dictionary entry}

sample : “abbreviations” file in the github repository

        3.2)  Train a Linear reg model from “words” to abbreviations that have a single word mapping, and extend the dictionary to new dictionary. ( not a good idea )
           Will handle three types of cases well.
               a) Delete vowels and possibly sonorant consonants (hdwd for hardwood)
               b) Delete all but the first syllable (ceil for ceiling)
               * but works well for  deletion(but not very good for substitution).

Ex-> But -> bt

Sample Code : available in the github repository “”, it works fine, try it or modify it to train on a larger abbriviated paralllel data to get best results

       3.3) Predicting words from a abbreviation using “Decision tree based search” using WFST(Weighted FST), will work for some.
          similar to implementation of : Sproat, Richard, et al. "Normalisation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.

WFST part is quite similar to the part, I used in my paper “Enhancing ASR by MT using Hindi WordNet” in ICON-2013 with Aniruddha Tammewar,Srinivas Bangalore and Michael Carl

Training data for 3.2 and 3.3 : There is a long list of abbreviations available on the web, but for this they need to filtered to find the one’s which satisfy 3.2(a & b) conditions.

4. Hashtags( easily hand-able using regex )

    problem : they can be of two types : #YoAreSocute
    Solution : First type is easy to handle, for second, who need to generate the possible words. So a dictionary look up for possible words or FST made of a dictionary can be used.

5. Unknown Characters in Words( works as 3)

   Problem : Sh**, f**k, ki**
   Solution : Solve them as elongated words were handled, or a domain can be kept in which people use these words most oftenly.
                  Further disambiguation can be done on the basics of a language model, made respectively from monolingual corpus of each language in Apertium.

6. Years

   Problem :  Apertium doesn’t handles 1980s, 80s, #1980
   Solution : make a simple rule to handle such cases.

7. Other Research Ideas Involved :

If one can use moses, to train a system, then one can train a system on parallel data that looks like :

  Source Text(abbreviations) : h d w d  
  Target Text(Full forms)     : h a r d w o o d

It will learn the alignments and then by character Language modelling it will lessen the options for the output words . Then upon it we can use Word based language model to further disambiguate.

ALTERNATE : Instead of language modelling take the output, which is there in dictionary and is most widely used in the language, possible in Apertium

8. End Goal

Handle the problems and solve them with regard to apertium, along with this there can be a possibility to work on 7. , and report results for improvement in Apertium’s output and also other MT system trained on Europal or standard datasets reported in WMT’14