User:Sl33k/Application

From Apertium
Jump to navigation Jump to search

Name: Ahsan Bagwan

IRC: sl33k_

Email ahsanbagwan@gmail.com

Skype ahsanbagwan


Why is it you are interested in machine translation?

I am quite impressed by the work of machine translation tools and techniques on most of the Indo-European languages. Although, the work in this sub-field of computational linguistic on Indian subcontinental languages is relatively low, but there are recent trends that show us that this picture is likely to change[1]. As a student interested in natural language processing, MT gives me a great platform to work closely with linguistics and the interaction of natural languages and computer languages.


Why is it that you are interested in the Apertium project?

Apertium was pretty fascinating to me when I first came across it. Largely, because of its appeal as a great open source MT engine and also the converting the linguistic data by the tools in a fairly comprehensive way even at the first glance. Its couples it with some easiest to follow documentation. Having been looking to gain some detailed knowledge on its working, it walked me through some basic concepts and linked pages to extra resources on the wiki, which has provided me a good base to build upon.

Which of the published tasks are you interested in?

Apertium-ur-hi: Adopting the Urdu-Hindi language pair.

Why should Google and Apertium sponsor it?

The ur-hi language pair has some initial work done to its Urdu and Hindi morphological analyser. Currently, there are no stable Indo-Aryan language pair in the repository. Making a stable language pair with release quality results would trigger the development with other related subcontinental language paired with the Hindi language.


Community Bonding Period

- Lots of reading on machine translation, urdu-hindi grammar and the book - The Indo-Aryan Languages.

- Setup the apertium environment and get more familiar with the tools.

- Plan more about conflicts that could arise when writing structural transfer rules for both the languages.

- Work out more detailed scripts for testing.

Work Plan:

Week 1: The tagsets are required to be of the same form for both Urdu and Hindi monolingual and bilingual dictionaries. Work on to make Hindi consistent with Urdu by following apertium tags. Also, test the dictionaries manually.

Week 2: The ur-hi bilingual dictionary needs more words to added to it.

Week 3: M. Humayoun has contributed much to Urdu but needs to be converted to lttoolbox. There is a module with hindi paradigms chopped using speling tools. Make similiar urdu language concord with the speling tools.

Week 4: Test the work done until then. If some confict aroses, resolve it.

Deliverable #1: Widely covered monolingual dictionaries and bilingual dictionary for ur-hi.

Week 5: The transfer rules file (.t1x) has some few rules in it to work with. Add more rules to it.

Week 6: Training of parts of speech tagger. None exist and .tsx file for both the language pairs would be made.

Week 7: Retraining the part of speech tagger. In this description, a part-of-speech tagger for the source language (SL) will be trained using information from the target language (TL).

Week 8: Continue with retraining the part of speech tagger. Finish off some testing.

Deliverable #2: Completed transfer rules and part of speech tagging and retraining.

Week 9: Carry out some tests to make sure nothing breaks.

Week 10: Begin with writing shell script to carry out testing of the vocabulary (testvoc). This will also be supported by some continually manual testing.

Week 11: Get some corpus testing done.

Week 12: Update the ur-hi documentation to reflect the changes. Includes up some brief summary about the woe

Deliverable #3: Completed project. Ready for release.


List you non-SoC activities

None except university exams from May 25 till June 10.

Bio


I am an undergraduate student of Information Technology in the final year at Sinhagad Academy of Engineering (India). We had a course included finite state transducers, computer laboratory assignments on XML and C++.

Experience

Programming languages: Competency in Python, C, Java. Limited experience with C++.

Markup languages: HTML, XML, reST


Notes


http://wiki.apertium.org/wiki/Contributing_to_an_existing_pair

http://secure.wikimedia.org/wikipedia/en/wiki/Hindi-Urdu_grammar

The Indo-Aryan Languages - Colin Masica [1]


References:


[1] Use of Machine Translation in India: Current Status http://www.mt-archive.info/MTS-2005-Naskar-2.pdf