Difference between revisions of "User:Aboelhamd"
Jump to navigation
Jump to search
(35 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Bachelor student, to-be Masters student, problem solver, languages lover, coding lover, always willing to learn and always willing to help. |
|||
'''GSOC 2019 : Extend weighted transfer rules'''[http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Extend_weighted_transfer_rules] |
|||
GSoC 2019 proposal => [[User:Aboelhamd/proposal]] |
|||
== Personal Details == |
|||
GSoC 2019 progress => [[User:Aboelhamd/progress]] |
|||
=== General Summary === |
|||
I am Aboelhamd Aly, a 24 years old Egyptian computer engineer. My first tongue is Arabic and not hieroglyphic :) . I am currently living in Alexandria, Egypt, and I intend to study masters abroad after finishing my undergraduate study. I love languages, AI and hence NLP. I have some research and industry experience in NLP, machine learning, parallel programming and optimization. I have been working alongside Sevilay Bayatli (piraye) on introducing new module (weighted transfer rule) to apertium , and that encouraged me to choose the idea "Extend weighted transfer rules" to continue our work, extend, integrate and deploy the full module. |
|||
=== Contacts === |
|||
Email : aboelhamd.abotreka@gmail.com <br /> |
|||
Facebook : https://www.facebook.com/aboelhamd <br /> |
|||
LinkedIn : https://www.linkedin.com/in/aboelhamd-aly-76384a102/ <br /> |
|||
IRC : aboelhamd <br /> |
|||
Github : https://github.com/aboelhamd |
|||
=== Education === |
|||
I am a senior bachelor student at Alexandria University in Egypt. Recently I have been granted a scholarship to study masters in data science at Innopolis University in Russia. |
|||
My undergraduate major is computer engineering, which dealt with everything in computers from the lowest level of zeros and ones to the highest level of HCI (human and computer interaction, mainly deals with user interface). <br /> |
|||
The subjects I loved the most were artificial intelligence, machine leaning, data mining and deep learning, and that's because I see very great potential in the AI field that could solve many of the problems humans face today. |
|||
=== Languages Love === |
|||
I love languages very much, especially Arabic, because it's a very very beautiful language and of course because it's the language of our holy scripture (Quraan), which I memorize more than half of it. Also I love Arabic literature and I have wrote several Arabic poems and short stories. All of that gave me a very good knowledge of classical and modern Arabic morphology, syntax and derivation. After Arabic comes English which I also love very much but surely not proficient at it like Arabic.<br /> |
|||
And so my love to languages and AI lead me to work in natural language processing field to combine my passion and knowledge in it. |
|||
=== Last Year GSoC === |
|||
Last year I tried to contribute in apertium to introduce a new pair (Arabic-Syriac) but I failed miserably, because I wasn't familiar at all with Syriac nor with apertium and also I began late in time which made me hasty, that I needed a less overwhelming project. |
|||
I then applied to classical language tool-kit project to enhance some Classical Arabic functionalities there and that was my proposal[https://docs.google.com/document/d/1Rw-jEaeOwbjYNPKOhgiCGH5aG3p0qNxMOoT1QD3rxg0/edit?usp=sharing]. Unfortunately I wasn't accepted in the program, though my mentor told me then, that Google gave them less spots than what they asked for :( , and that the other 3 applicants was postgraduate students that have more experience in the field and in open-source projects :( . <br /> |
|||
And after that I decided to contribute in an open-source project to gain both experiences and to try again next year, and here I am now :) . |
|||
=== Experience === |
|||
==== Apertium ==== |
|||
Sevilay and Me have been working into introducing the weighted transfer rules for months now. And we re-implemented a new module to handle ambiguous transfer rules , which parses, matches, applies transfer rules to the source and target sentence, then train maximum entropy models to be able to choose the best ambiguous rule for any given pattern.<br /> |
|||
==== Online courses ==== |
|||
I had taken many online courses with wide spectrum of the computer engineering field. One that I am very proud of, is udacity's machine-learning nano-degree[2] which is a six-months program, consists of many courses and practical projects regarding machine learning. |
|||
==== Industry ==== |
|||
Last summer I was hired as a software engineer intern in Brightskies tech. company. After the internship I was hired as a part-time software engineer.<br /> Our team is working on parallel programming, optimization and machine learning projects. The 2 biggest companies we are working with are Intel and Aramco.<br /> |
|||
My role is working on understanding, implementing, optimizing some seismic algorithms and kernels, besides doing some research on some machine learning algorithms and topics. |
|||
=== Why interested in apertium ? === |
|||
- I am very interested in NLP in general.<br /> |
|||
- Apertium has very noble goal, which is bringing languages with scarce data to life by linking them with machine translation of other languages.<br /> |
|||
- I have previous contribution in apertium and willing to build on it. |
|||
== Project Idea == |
|||
=== Weighted transfer rules === |
|||
When we have more than one transfer rule that can be applied to a given pattern, we call this an ambiguous situation. Apertium resolve this ambiguous situation by choosing the left-to-right longest match (LRLM) rule/s to apply. And that's of course is not adequate with all the word/s that follow that pattern/s. To solve this problem we introduced a way to make this ambiguous rules weighted for certain word/s that follow the ambiguous pattern/s. And this is done by training very huge corpus to capture better expressive weights.<br /> |
|||
1- First we train an n-gram -we use n=5- source language model.<br /> |
|||
2- We apply all ambiguous transfer rules for each ambiguous pattern in the given sentence, separately from each other -we apply LRLM rules to all other ambiguous patterns-, and get score from the n-gram model for each of the ambiguous sentences for that given pattern.<br /> |
|||
3- These scores are then written in some files, each file contains the scores of an ambiguous pattern. These files are considered the datasets for the tool (we use yasmet) which train target language max entropy models.<br /> |
|||
4- After having the models, our module is now ready for use. By using beam search algorithm we choose the best possible target.<br /> |
|||
For more detailed explanation you could refer to this documentation[https://docs.google.com/document/d/1t0VnUhw_LwN0oNL7Sk1fqSJyPnWdxYElesuIV_htn7o/edit?usp=sharing]. |
|||
=== Weighted transfer rules extension === |
|||
Now that weighted transfer module we worked on was built to apply only chunker transfer rules. And the idea want to extend that to interchunk and postchunk transfer rules too. Both of them are similar to the chunker, but with some differences. For example interchunk def-cats will refer to the tags of the chunk itself and not the lexical forms it contains like chunker, and for postchunk they refer to name of the chunk and has nothing to do with tags now. Also chunk element has different use, because it deals now with chunks not words. Also there are some differences in clip element attributes between the three transfer files. All these differences may be considered minor with respect the whole module that handle the chunker transfer rules. And I think adding these modifications will not take long time.<br /> |
|||
So in addition to this extension, I think introducing new ideas or modifications that could enhance the accuracy and efficiency of the whole module could be the best thing to do. Also I think I may work in related or not related ideas to this one to make full use of the 3 months period. |
|||
=== Latest updates on WTR module === |
|||
The module is now finished and is in the testing phase, it does well with Kazakh-Turkish pair and we hope it does as well with other pairs like Spanish-English pair which have more transfer rules than any other pair in apertium.<br /> |
|||
The module is separated from apertium core, that is installing apertium only is not enough as one should download and install our module separately to use it along with apertium. So Francis Tyers(spectei) told me integrate the module -without the training part- with apertium-transfer, and I did in that pull-request[https://github.com/apertium/apertium/pull/41]. Then he told me to make the module depends on libraries already used in apertium and not external ones, as I used 2 libraries pugixml to handle xml files and icu library to handle upper and lower cases. Also Kevin Unhammer(unhammer) gave me some helpful review on the code, and I am currently resolving all these issues. |
|||
=== Additional thoughts === |
|||
There are additional thoughts and modifications to the weighted transfer rules proposed in the aforementioned documentation[https://docs.google.com/document/d/1t0VnUhw_LwN0oNL7Sk1fqSJyPnWdxYElesuIV_htn7o/edit?usp=sharing].<br /> |
|||
And this are some of them, can be applied along with the extension :<br /> |
|||
1- At first we generated all the possible combinations of ambiguous target sentences for a given source sentence, and then we score them using the n-gram model. But for some long sentences, it gave us millions of combinations and hence all their scores would be very similar that we can't tell which is better. So instead we try the sampling method which we do now.<br /> |
|||
But I think if we split that long sentences, generating all possible combinations for the new sub sentences, it will give us very much less combinations than the long one, hence give use better scores.<br /> |
|||
2- As explained before, we score the whole sentence to actually score one of the ambiguous rules applied to one ambiguous pattern. And that is good because it captures the relation of other words with that pattern.<br /> |
|||
But what about scoring only the pattern we want to actually score, so if the pattern has only one word, we score it with 1-gram model, and if it has 3 words we score it with 3-gram model and so on. I think it may give more accurate score than scoring the whole sentence with the same n-gram model.<br /> |
|||
But it will make us skip the training of the max entropy models which takes many hours (typically 1-3 days), and that's because the score of one ambiguous pattern will be same every time. So we now just have to score all the ambiguous patterns, taking the highest score rule for each one, to have the best sentence.<br /> |
|||
3- With the my little experience in parallel programming and optimization, I could try refactoring the code to get the best performance and efficiency I can get of it.<br /> |
|||
=== Why google and apertium should sponsor it ? === |
|||
- The project enhances apertium translation of all pairs making it closer to human translation.<br /> |
|||
- I have the right experience and qualifications to complete it successfully. And since I participated in building the module, I will easily be able to extend it.<br /> |
|||
=== How and who will it benefit in society ? === |
|||
As the project will enhance apertium translation and make it closer to human translation, apertium will be more reliable and efficient to use in daily life and especially for document translation, which -in the long term- will enrich the data of languages with data scarcity, and hence help the speakers of such languages enriching and preserving their languages from extinction. |
|||
=== Other ideas ? === |
|||
== Work plan == |
|||
=== Exams and community bounding === |
|||
=== Schedule === |
|||
==== First milestone ==== |
|||
{| class="wikitable" border="1" |
|||
|- |
|||
| Week 1 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 2 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 3 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 4 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Deliverable |
|||
| |
|||
|} |
|||
==== Second milestone ==== |
|||
{| class="wikitable" border="1" |
|||
|- |
|||
| Week 5 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 6 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 7 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 8 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Deliverable |
|||
| |
|||
|} |
|||
==== Third milestone ==== |
|||
{| class="wikitable" border="1" |
|||
|- |
|||
| Week 9 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 10 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 11 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Week 12 |
|||
(From - To) |
|||
| |
|||
|- |
|||
| Deliverable |
|||
| |
|||
|} |
|||
=== Other summer plans === |
|||
[[Category:GSoC 2019 student proposals|Aboelhamd]] |
Latest revision as of 10:34, 21 April 2019
Bachelor student, to-be Masters student, problem solver, languages lover, coding lover, always willing to learn and always willing to help.
GSoC 2019 proposal => User:Aboelhamd/proposal
GSoC 2019 progress => User:Aboelhamd/progress