User:Aboelhamd/proposal

GSOC 2019 : Extend weighted transfer rules[1]

Personal Details[edit]

General Summary[edit]

I am Aboelhamd Aly, a 24 years old Egyptian computer engineer. My first tongue is Arabic and not hieroglyphic :) . I am currently living in Alexandria, Egypt, and I intend to study masters abroad after finishing my undergraduate study. I love languages, AI and hence NLP. I have some research and industry experience in NLP, machine learning, parallel programming and optimization. I have been working alongside Sevilay Bayatli (piraye) on introducing new module (weighted transfer rule) to apertium , and that encouraged me to choose the idea "Extend weighted transfer rules" to continue our work, extend, integrate and deploy the full module.

Contacts[edit]

Email : aboelhamd.abotreka@gmail.com
Facebook : https://www.facebook.com/aboelhamd
LinkedIn : https://www.linkedin.com/in/aboelhamd-aly-76384a102/
IRC : aboelhamd
Github : https://github.com/aboelhamd
Time zone : GMT+2

Education[edit]

I am a senior bachelor student at Alexandria University in Egypt. Recently I have been granted a scholarship to study masters in data science at Innopolis University in Russia.
My undergraduate major is computer engineering, which exposed me to almost everything in computers from the lowest level of zeros and ones to the highest level of HCI (human and computer interaction, mainly deals with user interface).
The subjects I loved the most were artificial intelligence, machine learning, data mining and deep learning, and that's because of the great potential in the AI field that already solved and could solve many of the problems humans face today.

Languages Love[edit]

I love languages very much, especially Arabic, because it's a very very beautiful language and of course because it's the language of our holy scripture (Quraan), which I memorize more than half of it. Also I love Arabic literature and I have wrote several Arabic poems and short stories. All of that gave me a very good knowledge of classical and modern Arabic morphology, syntax and derivation. After Arabic comes English which I also love very much but surely not proficient at it like Arabic.
And so my love to languages and AI lead me to work in natural language processing field to combine my passion and knowledge in it.

Last Year GSoC[edit]

Last year I tried to contribute in apertium to introduce a new pair (Arabic-Syriac) but I failed , because I wasn't familiar at all with Syriac nor with apertium and also I began late in time which made me hasty, that I needed a less overwhelming project. I then applied to classical language tool-kit project (cltk)[2] to enhance some Classical Arabic functionalities there and that was my proposal[3]. Unfortunately I wasn't accepted in the program, though my mentor told me then, that Google gave them less slots than what they asked for :( , and that the other 3 accepted applicants were postgraduate students that have more experience in the field and in open-source projects than me :( .
And after that I decided to contribute in an open-source project to gain both experiences and to try again next year, and here I am now :) .

Experience[edit]

Apertium[edit]

Sevilay and Me have been working into introducing the weighted transfer rules for months now. And we re-implemented a new module to handle ambiguous transfer rules , which parses, matches, applies transfer rules, then train maximum entropy models to be able to choose the best ambiguous rule for any given pattern. And then lastly, use these models to get the best possible target sentence.

Industry[edit]

Last summer I was hired as a software engineer intern in Brightskies technology, and after the internship I was hired as a part-time software engineer and I am still working there.
Our team is working on parallel programming, optimization and machine learning projects. The 2 biggest companies we are working with are Intel and Aramco.
My role is working on understanding, implementing, optimizing some seismic algorithms and kernels, besides doing some research on some machine learning algorithms and topics.

Online courses[edit]

I had taken many online courses in many of the computer engineering tracks. And One that I am very proud of, is udacity's machine-learning nano-degree[4] which is a six-months program, consists of many courses and practical projects regarding machine learning.

Why interested in apertium ?[edit]

- I am very interested in NLP in general.
- Apertium has a very noble goal, which is bringing languages with scarce data to life by linking them with machine translation of other languages.
- I have previous contribution in apertium and willing to build on it.

Project Idea[edit]

Weighted transfer rules[edit]

When more than one transfer rule could be applied to a given pattern, we call this ambiguous. Apertium resolve this ambiguity by applying the left-to-right longest match (LRLM) rule, and that is not adequate with all the word/s that follow that pattern/s.
To enhance this resolution, a new module was introduced to make this ambiguous rules weighted for the word/s that follow the ambiguous pattern, and this is done by training a corpus to generate maximum entropy models that are used to choose the best (highest weight) ambiguous rule to apply.
The module works as follows:
1- First we train an n-gram -we put n=5- source language model.
2- We split the corpus into sentences, for a given sentence we apply all ambiguous transfer rules for each ambiguous pattern separately from other ambiguous patterns -we apply LRLM rules to them-, and then get score from the n-gram model for each of the ambiguous sentences for that pattern.
3- These scores are then written in some files, each file contains the scores of an ambiguous pattern. These files are considered the datasets for yasmet tool, which trains target language max entropy models.
4- After having the models, the module is now ready for use. By using beam search algorithm we choose the best possible ambiguous rules to apply, hence having the best translation.
For more detailed explanation you could refer to this documentation[5].

Weighted transfer rules extension[edit]

The weighted transfer module was built to apply only chunker transfer rules. And this idea is to extend that module to be applied to interchunk and postchunk transfer rules too.
Both of them are similar to the chunker, but with some differences. For example interchunk def-cats will refer to the tags of the chunk itself and not the lexical forms it contains like chunker, and for postchunk they refer to name of the chunk and has nothing to do with tags now. Also chunk element has different use, because it deals with chunks not words. Also there are some differences in clip element attributes between the three transfer files.
All these differences may be considered minor with respect the whole module that handle the chunker transfer rules. And I think adding these modifications will not take long time.
So in addition to this extension, I think introducing new ideas or modifications that could enhance the accuracy and efficiency of the whole module could be necessary to do alongside the extension. Also I think I may work in related or not related ideas to this one to make full use of the 3 months period.

Latest updates on WTR module[edit]

The module is now finished and in the testing phase. It does well with Kazakh-Turkish pair and we hope it does as well with other pairs like Spanish-English pair which have more transfer rules than any other pair in apertium.
The latest code is uploaded in this repo[6]. The module is separated from apertium core, that is installing apertium only is not enough as one should download and install our module separately to use it along with apertium.

Coding Challenge[edit]

The coding challenge was to set up a pair and train the existing weighted transfer rule code, which I had done several times while testing and debugging the code.
Since I didn't have a coding challenge and also the module was separated from apertium core as mentioned before, Francis Tyers(spectei) told me integrate the module -without the training part- with apertium-transfer, and I did that in this pull-request[7].
Then he told me to make the module depends on libraries already used in apertium and not external ones, as I used 2 libraries pugixml to handle xml files and icu library to handle upper and lower cases, which are not used in apertium. Also Kevin Unhammer(unhammer) gave me some helpful review on the code, and these issues were resolved.

Additional thoughts[edit]

There are additional thoughts and modifications to the weighted transfer rules proposed in the aforementioned documentation[8].
And if some of them are valid, They could be applied along with the extension too. Also now I am looking for some newer machine or deep learning methods to apply as alternative for yasmet and max entropy method. The main search will be looking at some related papers that tackle some similar problems or the exact problem, and how some of them applied machine learning or deep learning to solve such problem/s.
Some of the papers I will begin with are :
1) Rule Based Machine Translation Combined with Statistical Post Editor for Japanese to English Patent Translation[9].(2007)
2) Machine translation model using inductive logic programming[10].(2009)
3) Machine Learning for Hybrid Machine Translation[11].(2012)
4) Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems[12].(2012)
5) Latest trends in hybrid machine translation and its applications[13].(2015)
6) Multi-Source Neural Translation[14].(2016)
7) Neural Machine Translation with Extended Context[15].(2017)
8) Handling Homographs in Neural Machine Translation[16].(2017)
9) Machine Translation: Phrase-Based, Rule-Based and NeuralApproaches with Linguistic Evaluation[17].(2017)
10) A Multitask-Based Neural Machine Translation Model with Part-of-Speech Tags Integration for Arabic Dialects[18].(2018)

Why google and apertium should sponsor it ?[edit]

- The project enhances apertium translation of all pairs making it closer to human translation.
- I have previous experience and the required qualifications to complete the project successfully. And since I participated in building the module, I will be able to extend it without much difficulty.
- By being accepted and successful in GSoC program, it would make a huge impact on my cv and hence my career.
- The stipend and the opportunity to have a job interview with google are huge benefits to a fresh graduate student like me.

How and who will it benefit in society ?[edit]

As the project will hopefully enhance apertium translation and make it closer to human translation, apertium will be more reliable and efficient to use in daily life and for document translation, which -in the long term- will enrich the data of languages with data scarcity, and hence help the speakers of such languages enriching and preserving their languages from extinction.

Work plan[edit]

Exams and community bounding[edit]

I am having my final exams from May 27 to June 20 and it's almost exactly the same as the first phase of GSoC this year, and since I will not be able to work in my exams duration and even I want at least one free week before the first exam, I will start earlier, even before the announcement of accepted students, and that's because I will continue contribution to the module anyways, if I got accepted or not.
So I will start working on the first phase on April 19 to May 16. And from May 17 to July 20 I will be taking my exams and I will still be able to do minor changes if necessary, and also will be open for discussions and chats about the first phase and the next one, to be ready when I came back to design and implement the code.

Schedule[edit]

Pre-GSoC[edit]

Week 1 (From April 5 - To April 11)	Continue code reformatting as proposed by mentors. Discuss with mentors what's next to add or modify in the code. Train the module with spa-eng pair and evaluate results.
Week 2 (From April 12 - To April 18)	Discuss with mentors some of the thoughts and ideas proposed in the documentation. Modify the documentation to reflect the new refactored code. Train the module with another third pair - other than kaz-tur and spa-eng - and evaluate results.
Deliverable	Weighted transfer rules module is integrated with apertium-transfer. Evaluation of weighted transfer rules module with 3 pairs, that will give us motivations and insights for further extending, enhancing and modifications.

First milestone[edit]

Week 1 (From April 19 - To April 25)	Continue and finish any non-complete tasks from the previous two weeks, further refactoring, bugs/issues fixing, documentation, evaluation and testing, etc. Search for and discuss some newer and more efficient methods other than training maximum entropy models. The search will include looking at some related papers and will start with the mentioned ones above.
Week 2 (From April 26 - To May 2)	Continue searching for and discussing some alternative methods to max entropy models. Discuss some other previously proposed ideas and thoughts, like : Generating all possible ambiguous combinations for some splits of a sentence and not the other two methods we have tried so far (to generate some samples or all combinations of a sentence). Other thoughts also on the default rule, alternative translations for one word in the lexical form, and others found in details in the documentation.
Week 3 (From May 3 - To May 9)	Design and Implement some of the valid thoughts and ideas.
Week 4 (From May 10 - To May 16)	Continue coding, testing and debugging. Write documentation. Train one pair and evaluate its accuracy.
Deliverable	Hopefully, more accurate, clean and robust weighted transfer rules module.
After exams (From June 21 - To June 28)	After exams, I will familiarize myself again with the code because my memory is not good enough :) . Also write the mentor evaluation, complete any unfinished documentation, tests or evaluations, and fix any reported issues or bugs.

Second milestone[edit]

Since training one pair takes some considerable time, typically 2-3 days, and in that time I will be idle. So we can follow a strategy of implementing or modifying one idea and then train a pair to show how much is the gain for such idea and while training, I can start in implementing another one and so on.

Week 5 (From June 28 - To July 4)	Compare the new evaluation with past evaluation. After discussing the results, start modifying, adding or removing some parts of the module. Or even start searching or designing another idea/s.
Week 6 (From July 5 - To July 11)	Search for, modify, design and implement some other ideas. Evaluate these ideas by training a pair and comparing with past results.
Week 7 (From July 12 - To July 18)	Fix any reported bugs or issues. Finish the enhancing phase, by making a documentation of the design and implementation done. And preparing a report with all the evaluation results and gain so far with the implemented ideas.
Week 8 (From July 19 - To July 25)	Read apertium2 document again, read deprecated or out of date parts from different sources and collect all the up to date transfer files specifications in a new document.
Deliverable	More accurate weighted transfer rules module, with an evaluation report and documentation.

Third milestone[edit]

Week 9 (From July 26 - To August 1)	Fix any errors found in the module after collecting the up to date specifications. Start designing and implementing the extension to the module to both inter- and post-chunk transfer files.
Week 10 (From August 2 - To August 8)	Continue coding and start testing and debugging.
Week 11 (From August 9 - To August 15)	Fix any reported bugs or issues. Finish coding, testing and debugging. Train one chosen pair and evaluate its accuracy.
Week 12 (From August 16 - To August 19)	Finish any incomplete tasks. Update the documentation and evaluation report.
Deliverable	Extended weighted transfer rules module, with documentation and evaluation report.

Other summer plans[edit]

- For the part-time job, as I was told by Francis that it's not compatible with GSoC, I decided to leave the job by April 15 before the first phase.
- For the first phase of GSoC I will still be in my college, but I will be able to allocate at least 30 hours per week for GSoC.
- For the second and third phases, college will have been finished, and I will be able to allocate at least 40 hours per week for GSoC.