User:Srbhr/GSOC 2020 Proposal: Automatic PostEditing
Contents
Title: Automatic Post-Editing/Improving Language Pairs by Mining Post-Edits
Contact Information
Name: Saurabh Rai
IRC Nick: srbhr
Location: New Delhi, India
Time Zone: UTC+5:30 (IST)
Email: srbh077@gmail.com
Github: https://github.com/srbhr[1]
LinkedIn: https://www.linkedin.com/in/saurabh-rai-9370a0194/ [2]
Who am I?: I'm a Undergraduate Computer Science Student, in 3rd year of college from GGS Indraprastha University, New Delhi. I'm interested in Machine Learning and Natural Language Processing, and always seek to find ways to improve stuff based on them. I love talking about technology, AI, and Cyberpunk 2077.
FOSS Software I have used: My Work always involve FOSS Software and Frameworks, from Python to Tensorflow, from Ubuntu to Arch Linux, I've used many and tried to tweak the software that I use. I have tried to contribute to some of the FOSS Frameworks as well. And I have taken Part in Making some as well.
Languages I know: Hindi(Native), English
Skills and Knowledge
I'm a Computer Science Student in my 3rd year of College
Languages that I know:
1. Python 2. C/C++ 3. Java 4. JavaScript
Languages that I have worked and I'm familiar with:
1. Julia 2. Bash
Machine Learning, Deep Learning, and Natural Language Processing.
ML Frameworks that I've worked with: Tensorflow, PyTorch, Spacy, Gensim, nltk, Flair, AllenNlp, openCV etc. Data Visualisation and stats libraries: pandas, numpy, seaborn, plotly, matplotlib, scipy etc. I've taken Online as well as offline courses to hone up my skills in the field. Not only this I've work with projects regaridng the same and I'm currently writing a research paper on Information Retrieval and Extraction (Stalled due to sudden college closure coz of CORONA Virus Pandemic).
Cources that I've taken during my college time (Only Relevant to project):
1. Mathematics (I-IV, it includes stats, calculus(both), and linear algebra) 2. Computer Programming 3. AI 4. Data Structures 5. Algorithm Design and Analysis
Why is it that you are interested in Machine Translation?
Machine Translation is one such field that tells us about how tasks like translating languages and serving humanity can be done easily, and not only this, it tells about the scope of how things can be in future, I'm interested in Mahcine Translation because I'm interested in discovering ways to improve it further and with more feautures, and the project I'm interested in working with does the same. To improve machine-translation by mining post-edits, and learn based on it.
Why is it that you are interested in Apertium?
Apertium was one of the first Machine Translating Tool invented in the early 2000's in Spain. Till then and since now, it has helped a lot of people translating languages and it's the only Machine Translating Tool/Engine that works offline as well as works with Low Resource lanugages. Not only this, it's an Open Source and free software for all, and it allows learning students to reasearch and work with them. Working with the Apertium Team(Mentors, and people) is a great opportunity to learn and improve one such great tool to newer accuracy measures and feautures.Not only this, but it also provides resources to learn and research with tools like the Morphological Dictionaries, Transfer Rules, Stream-Parser which can be used to create other tools as well for eg. Automatic Post-Editing tool to create dictionary entries after learning from it. etc.
My Proposal
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in the Project: Automatic Post-Editing, Mining Post-Editing Dumps(Parallel Corpora) to improve Translation under the Mentor: Mikel L. Forcada (More Mentors to be added). The Project Aims to find the difference in translation by the Apertium Machine Translation, and the Human Post-Edited and takes appropriate measures and learning algorithms to define the problems/mis-translations for the same and creates the required Dictionry Entries for the MT tool to improve the translation.

