User:Rahul/GSOCApplication

From Apertium
Jump to navigation Jump to search


Introduction[edit]

Name: Rahul Agarwal
Email: rahul.agarwal.90@gmail.com, rahul_agarwal@students.iiit.ac.in
Phone: +91-9963145450
IRC : rahul1@freenode

Why is it you are interested in machine translation?[edit]

I am very much fascinated by the difference in properties of the languages syntactically and semantically. I wasn't much excited about machine translation until 2 years ago when I had to choose a field for my major, I talked to a whole lot of people from different areas, finally I found my interest in Machine Translation. Some languages are very much complicated than others because some languages are very flexible. I have also worked on project related to machine translation at my research lab. Also machine translation tool is very helpful for the people having language problem, this is also a way to give something back to the society.


Why is it that you are interested in the Apertium project?[edit]

I find Apertium as a perfect platform to work on Machine Translation problems, as it provides with wide variety challenges to solve which occur during machine translation for example lexical transfer and reordering. Also When I talked to people already working on Apertium for a long time, I found them to be very knowledgeable and helpful, so it would a great opportunity for me to work with them and learn something in this area.


Which of the published tasks are you interested in? What do you plan to do?[edit]

Title[edit]

Tree-Based reordering

Abstract[edit]

There are dependency parsers based on constraint grammar for a few languages which Apertium would like to treat (e.g. the Sámi languages and Faroese), it might be a nice idea to be able to do re-ordering before transfer (or during transfer) based on the dependency tree (this would not do lexical transfer, concordance or anything else, just LU reordering).

Reasons why Google and Apertium should sponsor?[edit]

Apertium currently have Apertium-transfer module which also perform reordering but it performs reordering on the basis of just the syntactic tags. It might not be enough if we want to reorder a language having long distance relationship. So using dependency relation along with syntactic tags might prove to be beneficiary to perform reordering as Apertium already has dependency parser for some of the languages like (e.g. the Sámi languages and Faroese). This module can be plugged before the Apertium-transfer to improve the reordering.

How and who will it benefit in society?[edit]

This module can be very useful in improving the quality of machine translation especially for the people using languages having long distance relationship. Machine translation is always very helpful for the society, improving the quality of the translation would increase the relevance of the text, which in turn will provide ease to the users to understand the translated text.

List of skills[edit]

I am proficient in mostly high level programming languages like C, C++ and Java as well as scripting languages like Python, Perl, PHP.

Evidence of your qualifications[edit]

I am an MS by research student at International institute of information technology, Hyderabad(IIIT-H). I am currently in my final semester of my Btech. I have planned to pursue my MS in Language Technologies. I am working as Research Assistant at LTRC (Language Technology Research Centre) based at IIIT. I have also worked on several research projects since last 2 years. Below I have listed down the projects that I undertook. I have also worked as a summer intern at LTRC in summer 2009 and summer 2010. I have also attended a conferences (ICON 2010) based on NLP. My interest has generally been in Machine Translation. I worked on Statistical Machine Translation using discriminative approaches for a project. I have already tried tree-based reordering using statistical approach in my project. But it will be interesting to use rule based approach as rules are always promising at least if we talk in terms of precision.

Current field of study and major[edit]

Currently I am working in the field language technology.

Some of the projects which I have undertaken are:
1) Automatic Identification of Clause Boundary
2) Automatic selection of preposition senses for source preposition in their target counter-part for English-Hindi MT
3) Machine Translation using Discriminative approach
4) Rule based and Automatic validation of Hindi Treebank


Courses undertaken:
1) Natural language Processing
2) NLP Applications: (Machine Translation and Information Extraction)
3) Artificial Intelligence
4) Computation Linguistics
5) Information Retrieval
6) Data structures
7) Computer Programming
8) Software Engineering
9) Statistical Methods in Artificial Intelligence

Experience in open-source projects[edit]

I have worked on Drupal which I found very exciting to work on. But then my interest shifted towards language technology which is also my Major now. I think working on an open source projects is very helpful in increasing your knowledge and you also get to talk to a lot of intelligent people in that area. GSOC provides a great opportunity to achieve these things. I found interest in Apertium as it is not limited to one language pair but it is flexible enough to incorporate as many language pairs we want, if I get a chance to work on Apertium and everything goes well, I would also like to continue to work on Apertium in the future.

Non-Summer-of-Code plans you have for the summer, especially employment and class-taking[edit]

I might not be available from 11th may to 22nd may as I have some urgent work, but I am willing to cover for that time by increasing my work hours for other weak. Also I am attending summer school at my college on advanced natural language processing from 23 May 2011 to 5 June 2011 but that won’t affect my work for GSOC, apart from this I will be having only GSOC for my summer.


Work plan (including, if possible, a brief schedule with milestones and deliverable)[edit]

In this project, first I'll have to convert cg-proc (constraint grammar processor) so it can output dependency analyses. cg-proc makes use of Constraint Grammar (CG) within the Apertium MT platform and currently it doesn't output the dependency analysis. Dependency analysis is must to use it further along with syntactic tags for the reordering purpose.

Secondly I'll create a separate module, which given with the tags and dependency analysis (obtained from first) builds a tree. Once we have a tree and manually written rules (According the syntax of target language), I’ll make use of the features like dependency relations and the order of syntactic tags to reorder the source dependency tree. Reordering will also involve prioritizing; the rule will only be used if it is nearest to the current pattern. For e.g. If we have rule for 'det' , 'noun' and 'det noun' and we need to match them with the pattern having 'det noun', so third rule will be activated as it does nearly match the pattern other than two other rules.

Some of important things on which I will pay special attention are superblanks (Formatting in Apertium), partial sentences/parses (These parses are not enough or reliable, so here reordering can only be performed if we have something sensible as input. The main idea is to keep precision high but need not have a recall) and defining an XML format for the rules which makes sense, and also draws on experience from other modules in Apertium. I already have some ideas about handling superblanks which I shared with one of my possible project mentor Francis Tyers.

I have prepared the work plan thinking of maximum time the task may take. If a task for a week finishes before the end of the assigned interval, I’ll start working on the task for the next week.


Week Date Task
NA 01/04/23-01/05/23 Community Bonding Period
1 01/05/31 I'll try out how Apertium exactly works, specially vislcg3 module (understand the code and find out where the change is to be made), also some analysis of sme-nob language pair
2 01/06/07 Make changes in the cg-proc so that it can also provide dependency analysis
3 01/06/14 Test whether the above changes provides the same dependency analysis also do some regression testing to ensure it still functions same as before.
4 01/06/21 Come up with XML syntax to represent the rules and also use some methodology to take care of superblanks, about which I have already thought.
5 01/06/28 how exactly partial parses are to be handled, to do this I’ll perform some analysis of partial parses
6,7 01/07/05 Start building the module to build trees using the above dependency analysis and tags.
8,9 01/07/19 Plan how exactly the reordering will takes place in the trees. Once it is done, I’ll go about implementing it.
10 01/08/02 Test the above module for any bugs and remove the bugs if found any.
11 01/08/09 Clean up my work and a thorough testing of the whole module.
12 01/08/16 Evaluation, Module Release


Example for how the module will be useful[edit]

Many Bengali poets have sung songs in praise of this land.

Dependency Parse Tree[edit]

amod (poets-3, Many-1)
nn (poets-3, Bengali-2)
nsubj (sung-5, poets-3)
aux (sung-5, have-4)
dobj (sung-5, songs-6)
prep_in (sung-5, praise-8)
det (land-11, this-10)
prep_of (praise-8, land-11)
vmain ( NULL-0 , sung-5 )


Tree.jpg

The module for making the tree[edit]

Once I get the dependency analysis from cg-proc, I’ll use this analysis to make the dependency tree. Note that the child node will be at the leftmost position if, in the sentence it has the lowest position. For example children of sungs-5 are kept in dependency tree, poets-3 at the leftmost position as it has the lowest position out of the all the children of sung-5 in the sentence and praise-8 at the right most position, so all the child nodes are kept accordingly. This way the rules can be formulated keeping in mind the order in which nodes with particular syntactic tag occur for some particular parent in the target language. Example of reordering a particular node: Tree2.jpg

For this reordering, a rule (rule#4) formulated which is given written below. If we analyze the nodes after reordering it still follow the same concept I used to make the tree at the first hand. So printing the tree will also become an easy task here.

Note[edit]

1. In Hindi, Usually head comes at the rightmost position in the sentence then its children except in case when verb is a head and you have an 'aux' as one of its child, then in this case 'aux' will be at the rightmost position in the sentence.
2. Also when we are reordering a node then the whole sub tree is also moved with that node.
3. Prep_* can be (prep_in, prep_of or any preposition phrase)

Example Rules[edit]

For now, I think this is a very simple and good way to represent the rules. It can be easily applied on the tree which I’ll make using the module. In the future, rule format might be change if needed.

Rule#1

<reorder>
  <pattern>
    <child><pattern-item n="amod"/></child>
    <child><pattern-item n="nn"/></child>
    <head><pattern-item n="nsubj"/></head>
  </pattern>
  <out>
    <clip pos="1"/>  <!-- amod -->
    <clip pos="2"/>  <!-- nn -->
    <clip pos="3"/>  <!-- nsubj -->
  </out>
</reorder>

This rule doesn't reorder of the sentence. Actually this rule is not required this is just for an example.

Rule#2

<reorder>
  <pattern>
    <child><pattern-item n="det"/></child>   
    <head><pattern-item n="prep_*"/></head>  
  </pattern>
  <out>
    <clip pos="1"/>  <!-- det -->
    <clip pos="2"/>  <!-- prep_* -->
  </out>
</reorder>

This rule also doesn't change the order of the sentence. Actually both the rules 1 and 2 are not required, they are written just for an example.

Rule#3

<reorder>
  <pattern>
    <head><pattern-item n="prep_*"/></head>  
    <child><pattern-item n="prep_*"/></child> 
  </pattern>
  <out>
    <clip pos="2"/>  <!-- prep_* -->
    <clip pos="1"/>  <!-- prep_* -->
  </out>
</reorder>

This rule reorders the sentence and both the nodes "praise-8" "land-11" exchanges their positions. Sub tree of land i.e. this also gets moved with it.

Rule#4

<reorder>
  <pattern>
    <child><pattern-item n="nsubj"/></child>
    <child><pattern-item n="aux"/></child> 
    <head><pattern-item n="vmain"/></head>   
    <child><pattern-item n="nobj"/></child> 
    <child><pattern-item n="prep_*"/></child> 
  </pattern>
  <out>
    <clip pos="1"/>  <!-- nsubj -->
    <clip pos="5"/>  <!-- aux -->
    <clip pos="4"/>  <!-- vmain -->
    <clip pos="3"/>  <!-- nobj -->
    <clip pos="2"/>  <!-- prep_* -->
  </out>
</reorder>

This rule reorders the remaining part of the tree.
Final sentence after the reordering in Hindi Syntax:

Many Bengali poets this land of praise in songs sung have.


Advantages of using dependency relations with syntactic tags[edit]

If we look at rule-3, both of the nodes are having the same tag but the parent node should have rightmost position in comparison to its child node.
Here, the tag 'prep_in' which is a parent of 'prep_of' occurs before 'prep_of', but in Hindi, it gets position after its child so 'prep_of' is positioned before 'prep_in' as we can see in the rules and the final sentence after reordering. This reordering wouldn't have been possible if the rules were only based on the syntactic tags.