Difference between revisions of "User:Srbhr/GSOC 2020 Proposal: Automatic PostEditing"

From Apertium
Jump to navigation Jump to search
Line 129: Line 129:
 
A morphological guess will make it like Spanish verbs with lemmas ending in "ar" have "aban" as their imperfect 3rd person plural.
 
A morphological guess will make it like Spanish verbs with lemmas ending in "ar" have "aban" as their imperfect 3rd person plural.
 
Allowing us to build an entry from here itself.
 
Allowing us to build an entry from here itself.
  +
  +
  +
===Edit-Distance===
  +
  +
In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.
  +
  +
Edit-Distance Algorithms that can be used for this task:
  +
  +
Levenshtein Distance
  +
Damerau-Levenshtein Distance
  +
Jaro Distance
  +
Jaro-Winkler Distance
  +
Match Rating Approach Comparison
  +
Hamming Distance
  +
  +
For Token Comparision:
  +
  +
Jaccard Index
  +
Tversky Index
  +
Overlap Coefficient
  +
Cosine Similarity, etc.
  +
  +
All these algorithms are available through two Python Libraries (and are optimised).
  +
*Jellyfish[[https://pypi.org/project/jellyfish/]]
  +
*Textdistance[[https://pypi.org/project/textdistance/]]

Revision as of 22:15, 24 March 2020

Title: Automatic Post-Editing/Improving Language Pairs by Mining Post-Edits

Contact Information

Name: Saurabh Rai

IRC Nick: srbhr

Location: New Delhi, India

Time Zone: UTC+5:30 (IST)

Email: srbh077@gmail.com

Github: https://github.com/srbhr[1]

LinkedIn: https://www.linkedin.com/in/saurabh-rai-9370a0194/ [2]

Who am I?: I'm a Undergraduate Computer Science Student, in 3rd year of college from GGS Indraprastha University, New Delhi. I'm interested in Machine Learning and Natural Language Processing, and always seek to find ways to improve stuff based on them. I love talking about technology, AI, and Cyberpunk 2077.

FOSS Software I have used: My Work always involve FOSS Software and Frameworks, from Python to Tensorflow, from Ubuntu to Arch Linux, I've used many and tried to tweak the software that I use. I have tried to contribute to some of the FOSS Frameworks as well. And I have taken Part in Making some open source projects as well of my own.

Languages I know: Hindi(Native), English

Skills and Knowledge

I'm a Computer Science Student in my 3rd year of College

Languages that I know:

   1. Python
   2. C/C++
   3. Java
   4. JavaScript

Languages that I'm familiar with:

   1. Julia
   2. Bash

Machine Learning, Deep Learning, and Natural Language Processing.

   ML Frameworks that I've worked with: Tensorflow, PyTorch, Spacy, Gensim, nltk, Flair, AllenNlp, openCV etc.
   Data Visualisation and stats libraries: pandas, numpy, seaborn, plotly, matplotlib, scipy etc.
   I've taken Online as well as offline courses to hone up my skills in the field.
   Not only this I've work with projects regaridng the same and I'm currently writing a research paper on Information Retrieval and Extraction
  (Stalled due to sudden college closure coz of CORONA Virus Pandemic).

Cources that I've taken during my college time (Only Relevant to project):

   1. Mathematics (I-IV, it includes stats, calculus(both), and linear algebra)
   2. Computer Programming
   3. AI
   4. Data Structures
   5. Algorithm Design and Analysis

Why is it that you are interested in Machine Translation?

Machine Translation is one such field that tells us about how tasks like translating languages and serving humanity can be done easily, and not only this, it tells about the scope of how things can be in future, I'm interested in Mahcine Translation because I'm interested in discovering ways to improve it further and with more feautures, and the project I'm interested in working with does the same. To improve machine-translation by mining post-edits, and learn based on it.

Why is it that you are interested in Apertium?

Apertium was one of the first Machine Translating Tool invented in the early 2000's in Spain. Till then and since now, it has helped a lot of people translating languages and it's the only Machine Translating Tool/Engine that works offline as well as works with Low Resource lanugages. Not only this, it's an Open Source and free software for all, and it allows learning students to reasearch and work with them. Working with the Apertium Team(Mentors, and people) is a great opportunity to learn and improve one such great tool to newer accuracy measures and feautures.Not only this, but it also provides resources to learn and research with tools like the Morphological Dictionaries, Transfer Rules, Stream-Parser which can be used to create other tools as well for eg. Automatic Post-Editing tool to create dictionary entries after learning from it. etc.

My Proposal

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in the Project: Automatic Post-Editing, Mining Post-Editing Dumps(Parallel Corpora) to improve Translation under the Mentor: Mikel L. Forcada (More Mentors to be added). The Project Aims to find the difference in translation by the Apertium Machine Translation, and the Human Post-Edited and takes appropriate measures and learning algorithms to define the problems/mis-translations for the same and creates the required into information that can be inserted in that Apertium language pair. These information can be either:

  • Dictionaries
  • Constraint Grammar Rules
  • Lexical Selection Rules

Task Description

The main goal of the Project Automatic Post-Editing, Mining Post-Editing Dumps(Parallel Corpora) to improve Translation is to create automatically Dictionary Entries (Monodix, Bidix) as automatically and complete by miniming Post-Editing Dumps(Human Verified Parallel Corpora) to improve translation, and performance of an Apertium Language Pair, a long goal of this will be to automate the process of creating/enriching the Built Language Pairs that are under-incubation.

The Project Consists of two phases :

First Phase

In this phases the Data needs to gathered and converted into specific format, or Post-Editing Operators:

  • S : Source Text
  • MT(S): Machine Translation of S
  • PE(S) or PE(MT(S)): The Post-Edited Sentence (Considered to be accurate)

Now then from this a structruted data needs to be created, possibly a Pandas DataFrame (CSV, JSON) for Different Languages.


Second Phase

For this phase the morphological data generated by Apertium's Stream Parser will be used to get the tags/operations of each text. And comparision of the PE and MT of S, will yield some information that can be used to improve the Apertium Language Pair by creating Dictionary, Grammar, Lexical Selection Rules.

Process

The Rationale for this Process is described at [Rationale[3]]

  • The First Step is to get the Language Pair data and make it available in the required format(S, MT(S), PE(MT(S))).
  • Use the Data with the Appropriate Edit-Distance Algorithms(Explained Later) to find where we need to improve, and get the set of triplets where there is need

to improve.

  • Then using the streamlined data, our first approach will be to expand the sentences/word's morphological tags using Apertium's Streamparser and convert data

into Apertium's Stream format. This will reveal the operators that we are looking for to improve upon.For further easy of use we can store the whole stream or break down the words and store it in the DataFrame.

  For Example Consider the Case for English-Galician Pair:
  S: Never engage in action for the sake of reward.
  MT(S): Nunca comprometer en acción para o *sake de recompensa.
  Here sake is not translated.
  We can adapt an approach here to find the missing translation for *sake from the Parallel Corpus of Post-Edits and try to create an entry for it
  Consider another Scenario:
  S: Never engage in action for the purpose of reward.
  MT(S): Nunca comprometer en acción para o propósito de recompensa.
  Here for the same sentence we have word "sake"  and it's synonym "purpose".As "sake" doesn't have a translation pair, whereas "purpose -> propósito" does.
  We can use this data to create the entry for "sake" to Galician "ben". Take the example "For the sake of" to "Por ben de." in English Galician. 
  But this is highly recommmend that a Human Post-Edited Parallel Corpus should be available to verify the data.
  This example was explained by Mikel:
  S: Los marineros oteaban el horizonte.
  MT(S): The sailors *oteaban the horizon.
  PE(S): The sailors scanned the horizon.
  Here we can do scan.v -> otear.v [or scanned.v -> oteaban.v]
  A morphological guess will make it like Spanish verbs with lemmas ending in "ar" have "aban" as their imperfect 3rd person plural.
  Allowing us to build an entry from here itself.


Edit-Distance

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.

Edit-Distance Algorithms that can be used for this task:

   Levenshtein Distance
   Damerau-Levenshtein Distance
   Jaro Distance
   Jaro-Winkler Distance
   Match Rating Approach Comparison
   Hamming Distance

For Token Comparision:

   Jaccard Index
   Tversky Index
   Overlap Coefficient
   Cosine Similarity, etc.

All these algorithms are available through two Python Libraries (and are optimised).

  • Jellyfish[[4]]
  • Textdistance[[5]]