User:Sl33k/Application
Name: Ahsan Bagwan
IRC: sl33k_
Email ahsanbagwan@gmail.com
Skype ahsanbagwan
Contents
Why is it you are interested in machine translation?
I am quite impressed by the work of machine translation tools and techniques on most of the Indo-European languages. Although, the work in this sub-field of computational linguistic on Indian subcontinental languages is relatively low, but there are recent trends that show us that this picture is likely to change[1]. As a student interested in natural language processing, MT gives me a great platform to work closely with linguistics and the interaction of natural languages and computer languages.
Why is it that you are interested in the Apertium project?
Apertium was pretty fascinating to me when I first came across it. Largely, because of its appeal as a great open source MT engine and also the converting the linguistic data by the tools in a fairly comprehensive way even at the first glance. Its couples it with some easiest to follow documentation. Having been looking to gain some detailed knowledge on its working, it walked me through some basic concepts and linked pages to extra resources on the wiki, which has provided me a good base to build upon.
Which of the published tasks are you interested in?
Apertium-ur-hi: Adopting the Urdu-Hindi language pair.
Why should Google and Apertium sponsor it?
The ur-hi language pair has some initial work done to its Urdu and Hindi morphological analyser. Currently, there are no stable Indo-Aryan language pair in the repository. Making a stable language pair with release quality results would trigger the development with other related subcontinental language paired with the Hindi language.
Community Bonding Period
- Lots of reading on machine translation.
- Setup the apertium environment and get more familiar with the tools.
- Plan more about conflicts that could arise when writing structural transfer rules for both the languages.
- Work out more detailed scripts for testing.
Work Plan:
Week 1: Make sure the tagsets are consistent between M. Humayoun, IIIT and apertium.
Week 2: Convert M. Humayoun's Urdu Morphology to lttoolbox probably using speeling tools and full form list.
Week 3:
Week 4:
Deliverable #1: Widely covered monolingual dictionaries and bilingual dictionary for ur-hi.
Week 5: Train parts of speech taggers for both Urdu and Hindi.
Week 6:
Week 7:
Week 8:
Deliverable #2:
Week 9: Write transfer rules if any.
Week 10: Retrain parts of speech tagger with target language training.
Week 11:
Week 12: Update the ur-hi documentation to reflect the changes. Write up some brief summary about
Deliverable #3: Completed project. Ready for release.
Reasearch Done:
Bio
I am an undergraduate student of Information Technology in the final year at Sinhagad Academy of Engineering (India). We had a course included finite state transducers, computer laboratory assignments on XML and C++.
Experience
Programming languages: Competency in Python, C, Java. Limited experience with C++.
Markup languages: HTML, XML, reST
Notes
http://wiki.apertium.org/wiki/Contributing_to_an_existing_pair
http://secure.wikimedia.org/wikipedia/en/wiki/Hindi-Urdu_grammar
The Indo-Aryan Languages - Colin Masica [1]
References:
[1] Use of Machine Translation in India: Current Status http://www.mt-archive.info/MTS-2005-Naskar-2.pdf