Difference between revisions of "User:SilentFlame/proposal"

From Apertium
Jump to navigation Jump to search
(Created page with "Category:GSoC_2017_Student_Proposals == Contact Information == <b>Name:</b> Vinay Kumar Singh<br/> <b>E-mail address:</b> csvinay.d@gmail.com <br/> <b>IRC nick:</b> Silen...")
 
Line 1: Line 1:
[[Category:GSoC_2017_Student_Proposals]]

== Contact Information ==
== Contact Information ==
<b>Name:</b> Vinay Kumar Singh<br/>
<b>Name:</b> Vinay Kumar Singh<br/>

Revision as of 15:01, 2 April 2017

Contact Information

Name: Vinay Kumar Singh
E-mail address: csvinay.d@gmail.com
IRC nick: SilentFlame
Link to Github: https://github.com/SilentFlame

Why am I interested in machine translation?

It was always fascinating to me how I can somehow help people from one background to understand the culture and heritage of others, which require them to understand their literature which comes back to translating their literature to their native language, and this is where I have always liked to work.
Machine translation is one of the most important fields of Natural Language Processing (NLP) and also employs almost all the fields of NLP. At the same time it is a task with very practical and perceivable results, which actually benefit everyone. It is my interest only that I took my Masters in Computational Linguistics and have engaged myself in a semester project on Machine translation and masters project on semantic similarity between sentences.

Why am I interested in the Apertium project?

As I said previously I love to play with language and at the same time if possible help the society with something, I love that opportunity. So while going around on projects related to linguistics and its application along with something that employs my programming knowledge, with all this I ended up here on “Automatic blank handling”. It is really very important that we do quality machine translation but it is also of very important how well we show the computed outputs, the cleanness in the output stream and all. So doing the post-processing of the text is also an important aspect of the process and that is what I like about the project and the organisation which gives it importance as well.

Which of the published tasks are you interested in?

Automatic blank handling

Reasons why Google and Apertium should sponsor it?

Currently, Apertium translation text to text works well, but not so well in the task of handling blanks when our text is surrounded with tags (both inline and non-inline), like say, in HTML/XML files. As per the current way the place of tags does not change its position on translation even if the words are rearranged, which is a concern in the process as it might highlight or give tags to the wrong words which on the whole nullifies the entire concept of providing tags and highlights to the words.

  • Currently, transfer rule writers need to ensure all and only the input blanks are output in each rule, in the correct order (a three-pattern rule needs to output both and only and )
  • Even if rule writers do everything right, the fact that chunks-containing-blanks can move means they can still end up with invalid formatting (e.g. an inter-chunk rule swapping the order of ^chunk1{^hi<ij>$[<div>]^ho<ij>$}$ ^chunk2{^hi<ij>$[</div>]^ho<ij>$}$ )
  • improper assignment of tags to the translated words may sometimes change the semantic aspect of it, which is for sure a very serious problem.

And these needs to be resolved properly.


How and who will it benefit in the society?

This project basically works on improving the quality of translation. As mentioned above, ​currently the effect of inline tags, which are applied on the keywords in a given sentence is not handled properly. So this project works on fixing that. Will also make sure that the translated file does not end up being invalid HTML/XML.

This project will make apertium capable of tracking which word was reordered​ during the translation as the words are linked with their inline tags and thus will be reflected in the translated version too.

At the end this project will make apertium a really powerful machine translation tool and also make the translation very efficient as now all the words will independently store their inline tags separately. This project will help people to learn more languages, have good quality translation , and also help in building good and correct vocabulary just from observing the inputs and outputs.

Currently:

<i>Perro</i> <b>blanco</b> becomes <i>White</i> <b>dog</b>

After the project:

<i>Perro</i> <b>blanco</b> becomes <b>White</b> <i>dog</i>

It’s important for me to learn that “perro” in spanish means “dog” in english and not “white”, So this way this project is going to help people to build a correct vocabulary.

Currently it limits any possibility of accurately finding out which words were reordered during translation, But this kind of reordering information would be useful for systems like Mediawiki's Content Translation, one of the public translation forums. Hence we need to update our system to do the needed.

Detailed work plan

  • Make deformatters include a list of inline tags, and disperse these to the words covered by them.
  • Make pretransfer disperse tags when splitting lexical units.
  • Make transfer output the non-inline blanks before the rule output.
  • Make transfer handle inline-blanks, and ignore <b pos="N">
  • Make reformat turn inline-blanks back into real tags
    • [{<i>}]foo [{<i><b>}]bar should become <i>foo</i> <i><b>bar</b></i>
  • Ensure all other modules are fine with the new format for inline blanks.

Timeline

Community bonding period: Undersand the entire module dependency hirarchy and also workflow of the modules, study the resources required for the project. Play around with apertium.

Week 1-4
Getting the transfer module done

  • Make transfer output the non-inline blanks before the rule output.
  • transfer handle inline-blanks, and ignore <b pos="N">.

and merge the work from transfer.cc into interchunk.cc and postchunk.cc.

Deliverable-1
The compilation of the above algorithm and the modifications in the transfer rules along with completing all the testings.

Week 5-6

Improving the deformatter algorithm which was written in the coding challenge along with some insights from the existing prototype, along with adding some test to keep the improving and testing parallelly.

Integrating the implemented Algorithm to re/deformatter and moving towards the chain of translation.

Week 7-8

  • Working on lt-proc to make it correctly disperse inline blanks onto each lexical unit until the next “[”.
  • Working on lt-proc generations which is to convert a lexical form into the corresponding surface form, and correctly tagging the inline-blanks to the lexical units to which they belong, adding “[]” to make reformatter know where that inline-tag needs to be closed.

for eg:

$ [{<b><em>}]^beautiful<adj>$ [{<em><i>}]^mind<n><sg>$ -> [{<b><em>}]beautiful[] [{<em><i>}]mind[]

Desired output after lt-proc generation:

[{<b><em>}]beautiful[] [{<em><i>}]mind[]
  • Working on the analysis part of lt-proc which is converting surface forms into the set of possible lexical forms, and correctly disperse all the inline-blanks on each lexical unit covered by that inline-blank.

Eg:

Input: <p>one, <i>two and</i> three</p>

Desired output after deshtml and lt-proc analysis:

[][<p>]^one/one<num><sg>/one<prn><tn><mf><sg>$^,/,<cm>$ [{<i>}]^two/two<num><sp>$ [{<i>}]^and/and<cnjcoo>$ ^three/three<num><sp>$[][<\/p>
]

Deliverable-2

  • Complete the work on formatter and lt-proc modules to bring them to descent/integratable level .
  • start writing tests and testing them.

Week 9-13

  • Final deliverable.
  • Doing the cleanup of the codes and fixing minor/major bugs.
  • Complete Integration of modules.
  • Writing all the test and testing them
  • Ensuring all the non-modified modules are working fine with the new/edited ones.
  • Documenting the work done in the project.

Short self-introduction

I am 3rd year a Dual degree (B.Tech + MS) program student in Computer science and Engineering at IIIT-Hyderabad, India. I’m doing my Masters in Computational linguistics on the topic “semantic similarity between sentences” and also a semester project currently in machine translation.

I am fluent in C/C++, python, HTML, CSS, MySql and a bit of javascript.

I am highly interested in Natural language processing and its sub-fields which is why I’m pursuing my Masters in this field. It is also due to the very practical use of NLP processes in the day to day life which fascinates me.

Open source contributions

  • I have contributed to sympy with some minute bugs
  • added some language text data to cltk org
  • some contributions to FOSSASIA for their open-event web app and loklak-server
  • sagemath for some additions in their website

Merged open-source contributions.

College projects (Non-open source contributions)

I would be staying in the college in summers for my MS project and apart from that I have no prior commitments so would be able to devote most of my time (6-8 hours 6 days a week) to the project. Also will be taking a week vacation in July.

Coding challenges:

Deformatting

Pretransfer

Transfer

Non-Summer of code plans

Will be taking one week vacation in July.