User:Sanam/Application 2012

From Apertium
Jump to navigation Jump to search

Name: Sanam Ali

Email Address: eyeofthoughts@gmail.com

Contact Information: san_ on Irc #Apertium

Why is it you are interested in machine translation?

I first encountered the concept of Machine language at my university, where I learned some basics courses about it. The linguistic is the field which always attracts me, besides learning the languages I’m also aiming to do translational work upon them. Machine translation with the combination of linguistic is the quickest help for language learners in this growing e-world.

Why is it that you are interested in the Apertium project?

I am eager to put my efforts for languages translation, as the translational tools provide good help for the people of different nations to know each other and it also greatly removes the barrier of communication among them. The platform, I thought, which is most appropriate to accomplish this task is “Apertium”. The project idea provided by apertium, caught my attention and I find it quite suitable for me. I want to contribute to the orphaned language pair of Urdu- Hindi. Being Urdu native speaker and good hindi speaker, I am aiming to provide the best contribution to machine translation.

Which of the published tasks are you interested in?

I am interested in orphaned language pair of urdu-hindi under adopt a new language pair.

What do you plan to do?

My plan included following tasks:

1) As urdu-hindi language pair is in its early stage, the first thing needed is to develop and extend the monolingual and bilingual dictionaries up to the satisfied level.

2) Contribute the words in monolingual Urdu dictionaries from M. Humayoun's Urdu Morphology and from other sources especially wikipedia.

3) Develop the bilingual dictionary by utilizing words from wikipdia , M. Humayoun's Urdu Morphology and IIIT Hindi analyzer.

4) Complete the pending task for Hindi monolingual dictionary i.e converting verbs etc and also extend it utilizing other sources.

5) Training the POS taggers for both Urdu and Hindi

6) Write Transfer Rules according to needs.

7) Run quality Controls

Why should Google and Apertium Sponsor it?

The language pair, I adapted to work on is Urdu- Hindi. Both the Urdu and Hindi are most widely spoken languages in the world and considered major languages around the globe. There are more than 490 million speakers of both the languages, 150 million of Urdu and around 340 million of Hindi. By contributing to this pair will not only benefit the Urdu wikipedians and speakers but Hindi as well. How and who will it benefit in society?

Both Urdu and Hindi are closely related languages, and are national languages of Pakistan and India. Both the countries are looking for progress in the field of Computer science and IT development. This Urdu-Hindi machine translation system will help people of these countries to contribute in the development of technology not within their own territory but also to foreign world. In future, this machine translation system will also help to build translation system for other languages pairs having Hindi or Urdu like I go through the language pairs which needed a good contribution; there I find some pairs like Urdu-Punjabi, Urdu-Iranian Persian and Assamese – Hindi.

Work Plan:

Work already done(coding challenge and research):

I am a regular user on IRC of #Apertium. I have talked to developers of apertium about my different queries. I took much guidance from mentor Francis Tyers and accomplished the coding challenge.

1) I installed ubuntu via virtual box.

2) Completed Installation of apertium

3) Completed Installation of Ittoolbox

4) Installation of language pair of en-es and practiced with this pair.

5) Go through the HOWTO(Coding Challenge for ur-hi is in process)

6) I have done the complete translation of given story in both Hindu and Urdu, which is uploaded in the dev/ section of Apertium-ur-hi by mentor Francis Tyers. Here it is the Link of my work https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/dev/ under hi.txt and ur.txt.

7) I am learning about MT course given in code challenge and completed almost half of it.

I go through the pending task of both Hindi and Urdu languages and gathered many resources and done some research which will give me a good hand in my GSoc project. These resources cover following topics of Urdu and Hindi.

Grammatical

Morphological, lexicon and orthography resources

Dictionaries

Community Bonding Period:

Before the commencement of coding period of GSoc, my primary focus will be on following tasks:

1) Familiarization with the Community.

2) Better understanding of developing environment of Apertium and its tools.

3) Go through MT course more deeply.

4) Thorough inspection and usage of Apertium system and its related tools.

5) Enhance my knowledge related to adapted language pair.

6) Use of previously gathered resources and get some more good ones.

The coding period:

Week1 (May 21 – May 27)

  • Working on Monolingual Dictionary of Urdu.
  • Converting nouns from Humayoun's morphology
  • Implementation of morphological paradigms


Deliverable # 1

Week 2 + Week 3

  • Continue Working on monolingual dictionary of Urdu.
  • Addition of more necessary words
  • Working on morphological generator

Deliverable # 2

Week 4, 5 and 6

  • Start developing bilingual ur-hi dictionary
  • Conversion of words of urdu dictionary into its equivalent hindi

Deliverable # 3 and 4

Midterm Evaluation

Week 7, 8 and 9

  • Working on Monolingual dictionary of Hindi
  • Writing morphological paradigms
  • addition of necessary words specifically verbs
  • Checking of POS and tagger training

Deliverable # 5 and 6

Week 10, 11 and 12

  • Writing transfer rules where necessary
  • Running quality controls.
  • Wrapping up any necessary details

Deliverable # 7 and 8

Final submission and Evaluation( August 18 – August 24)

List your skills and give evidence of your qualifications:

I am currently Final year student of Master in Computer Science at Virtual University of Pakistan. I have studied different programming languages and courses during my studies at university and have worked on different projects and assignments. I am also a one year diploma holder in Computer Science and Information Technology.

I have good skills of C/C++, Java, SQL, HTML, HTML5, JavaScript, Xml,PHP, Python and basics of bash. I have working experience with MYSQL 2003, Wampserver, Tomcat, netbeans, MSAccess,VisualStudio and with different java platforms. By using above mentioned programming languages and with different tools I have developed the Electronic Card system, different web pages (dynamic and static), Voter’s registration system and a mini database project at my university. Currently, developing a multipurpose viva exam system based on both Web and desktop application by using C++, java, J2ME, Android and different dot Net technologies.

Besides Computational languages, I am also having a good understanding of various natural languages. I am native Urdu speaker and quite good at Urdu, Hindi and English languages and have medium skills of Spanish and Punjabi with some basic understanding of Persian language. I did one year Spanish language course From Escuela de Medicina running under ELAM in Cuba and 6 months of english language certificate locally.I am new to open source development but I am quite sure, I will accomplish my task with best of my efforts and understanding.


List any non-Summer-of-Code plans you have for the Summer:

Beside my academics activities and GSoc, no other work on hand. So, I will give 30 hours a week to GSoc easily. I will take my final semester exams in mid-July, which will take about 10 days but during this as well, I will remain committed with my GSoc project but with little time than usual. There is a month vacations in August, so whole august will be free to better prepare myself for final Evaluation and accomplish all the tasks.

References:

1.Humayon's Urdu Morphology [1]

2.IIIT Hindi and Urdu Shallow parser [2]

3. Hindi and Urdu Grammar [3]

4. Hindi and Urdu Modules [4]

5. Hindi Parser [5]

6. Urdu and Hindi: lexicon, phonology, morphology and syntax [6]

7. Urdu - Hindi bilingual Dictionary [7]

8. Urdu - hindi transliteration [8]