User:SilentFlame

1 Contact Information
2 Why am I interested in machine translation?
3 Why am I interested in the Apertium project?
4 Which of the published tasks are you interested in?
5 Reasons why Google and Apertium should sponsor it?
6 How and who will it benefit in the society?
7 Detailed work plan
8 Timeline
9 Short self-introduction
10 Coding challenges:
11 Non-Summer of code plans
12 Documentation of the work done in summer

Contact Information

Name: Vinay Kumar Singh
E-mail address: csvinay.d@gmail.com
IRC nick: SilentFlame
Link to Github: https://github.com/SilentFlame

Why am I interested in machine translation?

It was always fascinating to me how I can somehow help people from one background to understand the culture and heritage of others, which require them to understand their literature which comes back to translating their literature to their native language, and this is where I have always liked to work.
Machine translation is one of the most important fields of Natural Language Processing (NLP) and also employs almost all the fields of NLP. At the same time it is a task with very practical and perceivable results, which actually benefit everyone. It is my interest only that I took my Masters in Computational Linguistics and have engaged myself in a semester project on Machine translation and masters project on semantic similarity between sentences.

Why am I interested in the Apertium project?

As I said previously I love to play with language and at the same time if possible help the society with something, I love that opportunity. So while going around on projects related to linguistics and its application along with something that employs my programming knowledge, with all this I ended up here on “Automatic blank handling”. It is really very important that we do quality machine translation but it is also of very important how well we show the computed outputs, the cleanness in the output stream and all. So doing the post-processing of the text is also an important aspect of the process and that is what I like about the project and the organisation which gives it importance as well.

Which of the published tasks are you interested in?

Automatic blank handling

Reasons why Google and Apertium should sponsor it?

Currently, Apertium translation text to text works well, but not so well in the task of handling blanks when our text is surrounded with tags (both inline and non-inline), like say, in HTML/XML files. As per the current way the place of tags does not change its position on translation even if the words are rearranged, which is a concern in the process as it might highlight or give tags to the wrong words which on the whole nullifies the entire concept of providing tags and highlights to the words.

Currently, transfer rule writers need to ensure all and only the input blanks are output in each rule, in the correct order (a three-pattern rule needs to output both and only and )

Even if rule writers do everything right, the fact that chunks-containing-blanks can move means they can still end up with invalid formatting (e.g. an inter-chunk rule swapping the order of ^chunk1{^hi<ij>$[<div>]^ho<ij>$}$ ^chunk2{^hi<ij>$[</div>]^ho<ij>$}$ )

improper assignment of tags to the translated words may sometimes change the semantic aspect of it, which is for sure a very serious problem.

And these needs to be resolved properly.

How and who will it benefit in the society?
This project basically works on improving the quality of translation. As mentioned above, currently the effect of inline tags, which are applied on the keywords in a given sentence is not handled properly. So this project works on fixing that. Will also make sure that the translated file does not end up being invalid HTML/XML.
This project will make apertium capable of tracking which word was reordered during the translation as the words are linked with their inline tags and thus will be reflected in the translated version too.
At the end this project will make apertium a really powerful machine translation tool and also make the translation very efficient as now all the words will independently store their inline tags separately. This project will help people to learn more languages, have good quality translation , and also help in building good and correct vocabulary just from observing the inputs and outputs.
Currently:
Perro blanco becomes White dog
After the project:
Perro blanco becomes White dog
It’s important for me to learn that “perro” in spanish means “dog” in english and not “white”, So this way this project is going to help people to build a correct vocabulary.
Currently it limits any possibility of accurately finding out which words were reordered during translation, But this kind of reordering information would be useful for systems like Mediawiki's Content Translation, one of the public translation forums. Hence we need to update our system to do the needed.
Detailed work plan
Make deformatters include a list of inline tags, and disperse these to the words covered by them.

Make pretransfer disperse tags when splitting lexical units.

Make transfer output the non-inline blanks before the rule output.

Make transfer handle inline-blanks, and ignore 

Make reformat turn inline-blanks back into real tags
[{}]foo [{}]bar should become foo bar

Ensure all other modules are fine with the new format for inline blanks.
Timeline
Community bonding period: Undersand the entire module dependency hirarchy and also workflow of the modules, study the resources required for the project. Play around with apertium.
Week 1-4
Getting the transfer module done

Make transfer output the non-inline blanks before the rule output.

transfer handle inline-blanks, and ignore .
and merge the work from transfer.cc into interchunk.cc and postchunk.cc.
Deliverable-1
The compilation of the above algorithm and the modifications in the transfer rules along with completing all the testings.
Week 5-6
Improving the deformatter algorithm which was written in the coding challenge along with some insights from the existing prototype, along with adding some test to keep the improving and testing parallelly.
Integrating the implemented Algorithm to re/deformatter and moving towards the chain of translation.
Week 7-8

Working on lt-proc to make it correctly disperse inline blanks onto each lexical unit until the next “[”.

Working on lt-proc generations which is to convert a lexical form into the corresponding surface form, and correctly tagging the inline-blanks to the lexical units to which they belong, adding “[]” to make reformatter know where that inline-tag needs to be closed.
for eg:
$ [{}]^beautiful<adj>$ [{}]^mind<n><sg>$ -> [{}]beautiful[] [{}]mind[]
Desired output after lt-proc generation:
[{}]beautiful[] [{}]mind[]
Working on the analysis part of lt-proc which is converting surface forms into the set of possible lexical forms, and correctly disperse all the inline-blanks on each lexical unit covered by that inline-blank.
Eg:
Input: one, two and three
Desired output after deshtml and lt-proc analysis:
[][]^one/one<num><sg>/one<prn><tn><mf><sg>$^,/,<cm>$ [{}]^two/two<num><sp>$ [{}]^and/and<cnjcoo>$ ^three/three<num><sp>$[][<\/p> ]
Deliverable-2

Complete the work on formatter and lt-proc modules to bring them to descent/integratable level .

start writing tests and testing them.
Week 9-13

Final deliverable.

Doing the cleanup of the codes and fixing minor/major bugs.

Complete Integration of modules.

Writing all the test and testing them

Ensuring all the non-modified modules are working fine with the new/edited ones.

Documenting the work done in the project.
Short self-introduction
I am 3rd year a Dual degree (B.Tech + MS) program student in Computer science and Engineering at IIIT-Hyderabad, India. I’m doing my Masters in Computational linguistics on the topic “semantic similarity between sentences” and also a semester project currently in machine translation.
I am fluent in C/C++, python, HTML, CSS, MySql and a bit of javascript.
I am highly interested in Natural language processing and its sub-fields which is why I’m pursuing my Masters in this field. It is also due to the very practical use of NLP processes in the day to day life which fascinates me.
Open source contributions

I have contributed to sympy with some minute bugs

added some language text data to cltk org

some contributions to FOSSASIA for their open-event web app and loklak-server

sagemath for some additions in their website
Merged open-source contributions.
College projects (Non-open source contributions)

News popularity predictions: predicting how popular an online article (news or story) would be before its publication by analysing several statistical characteristics extracted from it.

Fact checker: A python written script to check whether fact you are checking is "True" or "False".

Client-Server FTP application

Ultimate Tic-Tac-Toe (Artificial Intelligence)

Donkey_Kong (pacman like game) in python

C-shell

Elektoo (A choice/search engine developed in MEAN framework)
I would be staying in the college in summers for my MS project and apart from that I have no prior commitments so would be able to devote most of my time (6-8 hours 6 days a week) to the project. Also will be taking a week vacation in July.
Coding challenges:
Deformatting

Challenge-1

Challenge-2
Pretransfer

Code-cleanup
Transfer
Build the first proof-of-concept
Non-Summer of code plans
Will be taking one week vacation in July.
Documentation of the work done in summer
Progress

Wiki of the work done

User:SilentFlame

Contents

Contact Information

Why am I interested in machine translation?

Why am I interested in the Apertium project?

Which of the published tasks are you interested in?

Reasons why Google and Apertium should sponsor it?

How and who will it benefit in the society?

Detailed work plan

Timeline

Short self-introduction

Coding challenges:

Non-Summer of code plans

Documentation of the work done in summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools