Revision as of 21:10, 7 April 2011

Name: Hajder Shouhani
E-mail address: hajderr@gmail.com
IRC: #apertium@Freenode, haydarekarrar

About me and why am I interested in MT?

I hope to finish my MSc degree in Computer Science this semester and I'm currently doing my MSc degree project within Arabic NLP. The title of the project is “Arabic Language Analysis Toolkit” and my aim is to investigate existing tools and resources for the Arabic language in order to build a small toolkit. In particular the toolkit will consist of a morphological analyzer and a PoS tagger by the end of the project. These tools are fundamental tools for any NLP application, e.g. machine translation that I’m interested in due to the challenges that exist in making accurate translations. Better platforms and tools for translating would overcome the language barriers and lead to information being available easily and via different sources to all people. It’s these challenges that interests me and drives me to make better NLP tools.

Why am I interested in the Apertium project?

With the above in mind and being new to the Apertium project, I've chosen an Entry Level project of adopting a new language pair: Norwegian - Swedish. Probably Norwegian Bokmål and not Nynorsk - feedback on this please!

Apertium will allow me to work on what I consider to be one of the most interesting fields within NLP. Secondly, I’ve wanted to contribute to the open-source community for a while but have not found any interesting organization or considered myself to be competent enough. With this year’s GSoC I hope I can change that by being accepted to the programme.

Who will this benefit and why should it be sponsored?

Generally translations between any language pair A-B will always have the benefit of language B speakers understanding language A texts - that's the main purpose by making information in language A available in B. It can be used as a learning tool as well, for example someone may have limited knowledge in language A and be fluent in language B, such a translation can be used for learning language A better. I know that's what I've used Google Translate for certain language pairs.

Despite Norwegian being closely related to Swedish (think Spanish – Catalan maybe?) it's not necessarily the case that two native speakers understand each other fully. There are enough differences for justifying a translation, more linguistic data for NLP application is never really redundant, especially in this time where information is available globally and translations fill the purpose of bridging the gap between two languages. It’s hard to predict beforehand all the benefits for end-users of adopting this pair are but obviously an interest in covering Scandinavian languages exist in order to offer a better toolbox to the community.

Work plan

What I had in mind is to divide the three months like background reading and documentation of the languages, building/coding the language pair in Apertium in parallel and finally testing/feedback and finalize the build.

I want to follow a similar approach to the one in the New Language Pair Howto by building the language pair in phases. Specifically I thought of two groups that I'll finish in two phases.

Word group A – nouns, verbs and prepositions Word group B – adjectives, pronouns and articles

(obviously including cases, numbers etc)

Week 1: Background reading: revise wiki articles New Language Pair and Contribute to an existing pair, terminology and Apertium documentation.
Week 2: Background reading: Swedish and Norwegian grammar.
Week 3: Start working on the dictionaries and transfer rules for word group A.
Week 4: Continue and finish the dictionaries and transfer rules for the language pair (still, group A), make ready for first deliverable.
Deliverable #1 – Document of the linguistic theory of word group A and the work done on the language pair so far (dictionaries and transfer files). The background reading and grammar about the languages can be published on Apertium Wiki, useful for future reference for developers.
Week 5: Start working on the dictionaries and transfer rules for word group B.
Week 6: Continue and finish the dictionaries for the language pair, make ready for second deliverable.
Week 7: Test the dictionaries using using testvoc.
Week 8: Fix bugs/errors in both word groups.
Deliverable #2 The dictionaries and transfer files, document of word group B and its grammar. Results from first round of testing.
Week 9: Testing, evaluate
Week 10: Implement feedback, bugs, missing words from week 9
Week 11: Reiterate the process as in week 9-10. Test, evaluate
Week 12: Implement feedback received in previous week, finalize build and make ready for release.
Project completed

Other commitments

I try to set aside 2-3 hours per week for solving programming contest problems. Other than that I'll be free during the summer and it should not interfere with my work for Apertium.