Difference between revisions of "User:OmarKassem"
OmarKassem (talk | contribs) |
OmarKassem (talk | contribs) |
||
Line 4: | Line 4: | ||
=== General Summary === |
=== General Summary === |
||
I am Omar Kassem, a senior Computer Engineering Student |
I am Omar Kassem, a senior Computer Engineering Student. I have some research in machine learning and deep learning, and I am currently working on my Graduation Project which is a Deep Learning problem called Visual Question Answering. |
||
Last year I started learning more about NLP and I found it interesting. I am currently working on solving a challenge offered by GoogleAI about Gendered Pronoun Resolution which is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language. |
Last year I started learning more about NLP and I found it very interesting. I am currently working on solving a challenge offered by GoogleAI about Gendered Pronoun Resolution which is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language. |
||
Line 19: | Line 18: | ||
=== Education === |
=== Education === |
||
I am a student at Alexandria University in Egypt my Major is Computer Engineering. The undergraduate curriculum in Computer and System Engineering at Alexandria University introduces me to a wide variety of engineering subjects. Various courses like Artificial Intelligence, Data Mining, Deep Learning, Networks, Compilers, Data Structures & Algorithms, Software Engineering, Operating Systems provided me with a strong footing in the theoretical concept of Computer Science and Engineering. |
|||
I am a senior bachelor student at Alexandria University in Egypt. Recently I have been granted a scholarship to study masters in data science at Innopolis University in Russia.<br /> |
|||
While offering both depth and breadth across this field, these courses put into perspective the importance and relevance of Computer Science and Engineering and the application of its fundamentals to the problems faced by the real world. So, I can realize that learning and developing my knowledge of Computer Science. |
|||
My undergraduate major is computer engineering, which exposed me to almost everything in computers from the lowest level of zeros and ones to the highest level of HCI (human and computer interaction, mainly deals with user interface). <br /> |
|||
The subjects I loved the most were artificial intelligence, machine learning, data mining and deep learning, and that's because of the great potential in the AI field that already solved and could solve many of the problems humans face today. |
|||
=== Last Year GSoC === |
=== Last Year GSoC === |
||
I then applied to classical language tool-kit project (cltk)[http://cltk.org/] to enhance Arabic support and adding new functionalities (e.g. Word segmenter, Lemmatization, Part-of-speech tagging, etc.) and that was my proposal[https://docs.google.com/document/d/1HHyKsx1oM0I1Y5kQOAMr7fzdTFovg1Olf1ztoxKCh3s/edit?usp=sharing] but Unfortunately I wasn't accepted in the program. |
I then applied to classical language tool-kit project (cltk)[http://cltk.org/] to enhance Arabic support and adding new functionalities (e.g. Word segmenter, Lemmatization, Part-of-speech tagging, etc.) and that was my proposal[https://docs.google.com/document/d/1HHyKsx1oM0I1Y5kQOAMr7fzdTFovg1Olf1ztoxKCh3s/edit?usp=sharing] but Unfortunately I wasn't accepted in the program. |
||
This year I will apply in this task only. |
|||
Line 37: | Line 37: | ||
==== Online courses ==== |
==== Online courses ==== |
||
I finished udacity's machine-learning nano-degree which is a six-months program and this is the Certificate[https://confirm.udacity.com/LMDXL4AD]. In this program I mastered Supervised, Unsupervised, Reinforcement, and Deep Learning fundamentals. |
|||
Learning fundamentals. |
|||
Line 46: | Line 45: | ||
== Project Idea == |
== Project Idea == |
||
The MT strategy used in the system is a classical shallow-transfer or transformer system consisting of an 8-module assembly line. To ease diagnosis and independent testing, modules communicate between them using text streams. This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural language processing tasks. |
|||
The main idea is developing compilers to convert the linguistic data into the corresponding efficient form used by each of the modules of the engine. Four compilers are used in this project: |
|||
Here is a brief description of the proposed Compilers of all modules:- |
|||
1- A source-to-source compiler which takes in MorphTrans-style format (with keywords in English) (described [http://www.internostrum.com/docum/morphtrans.ps here]) and generate the current XML(i.e., .t1x, .t2x and .t3x). |
|||
2- An XSLT stylesheet which, executed on a standard XSLT processor, reads in the XML file with structural transfer rules and generate MorphTrans-style code. |
|||
3- A source-to-source compiler which takes in InterNostrum formatted file (described [http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf here]) and output a (.dix) file which is used in the four lexical processing modules (morphological analyser, lexical transfer, morphological generator, post-generator). These modules are currently reads binary files containing a compact and efficient representation of a class of finite-state transducers. These binaries are generated from (.dix) XML dictionaries. |
|||
4- An multi-pass XSLT stylesheet to convert the XML Dictionary file (.dix) to InterNostrum formatted file (Morphological text-format dictionary) |
|||
Round trip check tool will be used to convert from text to xml and then xml back to text or vice-versa to validate there is no error during the conversion from one form to another. |
|||
Also the Makefiles will be edited so the compilers will run to compile the new files once they are updated to convert them to the other formate (e.g XML) |
|||
Line 57: | Line 72: | ||
=== Why google and apertium should sponsor it ? === |
=== Why google and apertium should sponsor it ? === |
||
An adequate documentation of the code and auxiliary files is crucial for the success of open source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The format used now is XML which is very overt and clear, but clumsy and hard to write. It also may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules. |
|||
=== How and who will it benefit in society ? === |
=== How and who will it benefit in society ? === |
||
Changing the format of the dictionaries and the structure transfer rules will allow the linguist to focus on describing the lexicon and morphology of the language in question in a simple format and frees him or her of having to think as a programmer. |
|||
Line 74: | Line 92: | ||
|- |
|- |
||
| Week 1 |
| Week 1 |
||
(From April |
(From April 9 - To April 11) |
||
| |
| |
||
Continue working on the coding challenge. |
|||
|- |
|||
| Week 2 |
|||
(From April 12 - To May 18) |
|||
| |
|||
|- |
|||
| Deliverable |
|||
| |
|||
|} |
|} |
||
==== First milestone ==== |
==== First milestone ==== |
||
I will start working on the task early (typically I will start coding on May 6 once google announce the results to save time as my final exams will start on May 27 to June 20 ). |
|||
{| class="wikitable" border="1" |
{| class="wikitable" border="1" |
||
|- |
|- |
||
| Week 1 |
| Week 1 |
||
(From |
(From May 6 - To May 13) |
||
| |
| |
||
Investigating more about the transfer rule files and understanding the MorphTrans-style |
|||
|- |
|- |
||
| Week 2 |
| Week 2 |
||
(From |
(From May 13 - To May 20) |
||
| |
| |
||
Researching the best way of augmenting the current MorphTrans-style and expanding it to adapt it to the other .t2x and .t3x files |
|||
|- |
|- |
||
| Week 3 |
| Week 3 |
||
(From May |
(From May 20 - To May 27) |
||
| |
| |
||
Developing the first compiler which takes in MorphTrans-style format and generate the current XML(i.e., .t1x, .t2x and .t3x). |
|||
|- |
|||
| Week 4 |
|||
(From May 10 - To May 16) |
|||
| |
|||
|- |
|- |
||
| Deliverable |
| Deliverable |
||
| |
| |
||
The MorphTrans to XML compiler |
|||
|- |
|- |
||
| Week |
| Week 4 |
||
(From |
(From May 27 - To June 20) |
||
| |
| |
||
I will be taking my exams and I will be available for any changes in the delivered work or any discussion for the upcoming tasks. |
|||
|} |
|} |
||
Line 130: | Line 132: | ||
==== Second milestone ==== |
==== Second milestone ==== |
||
{| class="wikitable" border="1" |
{| class="wikitable" border="1" |
||
|- |
|- |
||
| Week 5 |
| Week 5 |
||
(From June |
(From June 21 - To June 28) |
||
| |
| |
||
Fixing any error in the MorphTrans to XML compiler, start working on the compiler that will do the contrary: take the XML and generate the Morphtrans' style input using XSLT. |
|||
|- |
|- |
||
| Week 6 |
| Week 6 |
||
(From |
(From June 28 - To July 4) |
||
| |
| |
||
Finish developing the compiler which takes XML and generate the Morphtrans' style input using XSLT. |
|||
|- |
|- |
||
| Week 7 |
| Week 7 |
||
(From July |
(From July 5 - To July 11) |
||
| |
| |
||
Creating Round trip checker and perform the validation for the both way of conversion.<br/> |
|||
Updating the Makefile so updating one formate will affect the other. |
|||
|- |
|||
| Week 8 |
|||
(From July 19 - To July 25) |
|||
| |
|||
|- |
|- |
||
| Deliverable |
| Deliverable |
||
| |
| |
||
The final Compiler for the both way conversion of the format of Transfer Rules |
|||
|} |
|} |
||
Line 167: | Line 159: | ||
{| class="wikitable" border="1" |
{| class="wikitable" border="1" |
||
|- |
|||
| Week 8 |
|||
(From July 12 - To July 18) |
|||
| |
|||
Investigating more about the different dictionary files, understanding the InterNostrum file and finding an optimal way to specify the alphabet for InterNostrum. |
|||
|- |
|- |
||
| Week 9 |
| Week 9 |
||
(From July 19 - To July 25) |
|||
| |
|||
Developing the compiler which takes in InterNostrum format and generate the XML(.dix) file. |
|||
|- |
|||
| Week 10 |
|||
(From July 26 - To August 1) |
(From July 26 - To August 1) |
||
| |
| |
||
Finish developing the compiler which takes in InterNostrum format and generate the XML(.dix) file, Starting investigating the multi-pass XSLT and start developing the contrary compiler. |
|||
|- |
|- |
||
| Week |
| Week 11 |
||
(From August 2 - To August 8) |
(From August 2 - To August 8) |
||
| |
| |
||
Finish developing the contrary compiler which takes in XML(.dix) file and generate the InterNostrum format. |
|||
|- |
|- |
||
| Week |
| Week 12 |
||
(From August 9 - To August 15) |
(From August 9 - To August 15) |
||
| |
| |
||
Creating Round trip checker and perform the validation for the both way of conversion.<br/> |
|||
Updating the Makefile so updating one formate will affect the other. |
|||
|- |
|- |
||
| Week |
| Week 13 |
||
(From August 16 - To August 19) |
(From August 16 - To August 19) |
||
| |
| |
||
Finishing the documentation and final testing. |
|||
|- |
|- |
||
| Deliverable |
| Deliverable |
||
| |
| |
||
The final Compiler for the both way conversion of the format of Dictionary (.dix)files. |
|||
|} |
|} |
||
Line 199: | Line 201: | ||
=== Other summer plans === |
=== Other summer plans === |
||
Google Summer of Code would be my main plan for the whole summer. For the first phase of GSoC I will be taking my exams so I will finish most of the work in the community bonding phase. For the rest of GSoC I'll be able to dedicate around 30 to 40 hours that week to the project. |
|||
Revision as of 22:28, 8 April 2019
GSOC 2019 : Light alternative format for all XML files in an Apertium language pair[1]
Contents
Personal Details
General Summary
I am Omar Kassem, a senior Computer Engineering Student. I have some research in machine learning and deep learning, and I am currently working on my Graduation Project which is a Deep Learning problem called Visual Question Answering. Last year I started learning more about NLP and I found it very interesting. I am currently working on solving a challenge offered by GoogleAI about Gendered Pronoun Resolution which is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language.
Contacts
Email : omarahmed1473@outlook.com
LinkedIn : https://www.linkedin.com/in/omarQasim10/
IRC : OmarKassem
Github : https://github.com/omarahmed10
Time zone : GMT+2
Education
I am a student at Alexandria University in Egypt my Major is Computer Engineering. The undergraduate curriculum in Computer and System Engineering at Alexandria University introduces me to a wide variety of engineering subjects. Various courses like Artificial Intelligence, Data Mining, Deep Learning, Networks, Compilers, Data Structures & Algorithms, Software Engineering, Operating Systems provided me with a strong footing in the theoretical concept of Computer Science and Engineering. While offering both depth and breadth across this field, these courses put into perspective the importance and relevance of Computer Science and Engineering and the application of its fundamentals to the problems faced by the real world. So, I can realize that learning and developing my knowledge of Computer Science.
Last Year GSoC
I then applied to classical language tool-kit project (cltk)[2] to enhance Arabic support and adding new functionalities (e.g. Word segmenter, Lemmatization, Part-of-speech tagging, etc.) and that was my proposal[3] but Unfortunately I wasn't accepted in the program. This year I will apply in this task only.
Experience
Industry
Last summer I was a IOS developer intern in InovaEg(company here in Alex). I was working on adding new features for Ummahlink IOS app using Swift programming language.
Online courses
I finished udacity's machine-learning nano-degree which is a six-months program and this is the Certificate[4]. In this program I mastered Supervised, Unsupervised, Reinforcement, and Deep Learning fundamentals.
Why interested in apertium ?
As I started being interested in NLP, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributors.
Project Idea
The MT strategy used in the system is a classical shallow-transfer or transformer system consisting of an 8-module assembly line. To ease diagnosis and independent testing, modules communicate between them using text streams. This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural language processing tasks.
The main idea is developing compilers to convert the linguistic data into the corresponding efficient form used by each of the modules of the engine. Four compilers are used in this project:
Here is a brief description of the proposed Compilers of all modules:-
1- A source-to-source compiler which takes in MorphTrans-style format (with keywords in English) (described here) and generate the current XML(i.e., .t1x, .t2x and .t3x).
2- An XSLT stylesheet which, executed on a standard XSLT processor, reads in the XML file with structural transfer rules and generate MorphTrans-style code.
3- A source-to-source compiler which takes in InterNostrum formatted file (described here) and output a (.dix) file which is used in the four lexical processing modules (morphological analyser, lexical transfer, morphological generator, post-generator). These modules are currently reads binary files containing a compact and efficient representation of a class of finite-state transducers. These binaries are generated from (.dix) XML dictionaries.
4- An multi-pass XSLT stylesheet to convert the XML Dictionary file (.dix) to InterNostrum formatted file (Morphological text-format dictionary)
Round trip check tool will be used to convert from text to xml and then xml back to text or vice-versa to validate there is no error during the conversion from one form to another.
Also the Makefiles will be edited so the compilers will run to compile the new files once they are updated to convert them to the other formate (e.g XML)
Coding Challenge
Why google and apertium should sponsor it ?
An adequate documentation of the code and auxiliary files is crucial for the success of open source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The format used now is XML which is very overt and clear, but clumsy and hard to write. It also may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.
How and who will it benefit in society ?
Changing the format of the dictionaries and the structure transfer rules will allow the linguist to focus on describing the lexicon and morphology of the language in question in a simple format and frees him or her of having to think as a programmer.
Work plan
Schedule
Pre-GSoC
Week 1
(From April 9 - To April 11) |
Continue working on the coding challenge. |
First milestone
I will start working on the task early (typically I will start coding on May 6 once google announce the results to save time as my final exams will start on May 27 to June 20 ).
Week 1
(From May 6 - To May 13) |
Investigating more about the transfer rule files and understanding the MorphTrans-style |
Week 2
(From May 13 - To May 20) |
Researching the best way of augmenting the current MorphTrans-style and expanding it to adapt it to the other .t2x and .t3x files |
Week 3
(From May 20 - To May 27) |
Developing the first compiler which takes in MorphTrans-style format and generate the current XML(i.e., .t1x, .t2x and .t3x). |
Deliverable |
The MorphTrans to XML compiler |
Week 4
(From May 27 - To June 20) |
I will be taking my exams and I will be available for any changes in the delivered work or any discussion for the upcoming tasks. |
Second milestone
Week 5
(From June 21 - To June 28) |
Fixing any error in the MorphTrans to XML compiler, start working on the compiler that will do the contrary: take the XML and generate the Morphtrans' style input using XSLT. |
Week 6
(From June 28 - To July 4) |
Finish developing the compiler which takes XML and generate the Morphtrans' style input using XSLT. |
Week 7
(From July 5 - To July 11) |
Creating Round trip checker and perform the validation for the both way of conversion. |
Deliverable |
The final Compiler for the both way conversion of the format of Transfer Rules |
Third milestone
Week 8
(From July 12 - To July 18) |
Investigating more about the different dictionary files, understanding the InterNostrum file and finding an optimal way to specify the alphabet for InterNostrum. |
Week 9
(From July 19 - To July 25) |
Developing the compiler which takes in InterNostrum format and generate the XML(.dix) file. |
Week 10
(From July 26 - To August 1) |
Finish developing the compiler which takes in InterNostrum format and generate the XML(.dix) file, Starting investigating the multi-pass XSLT and start developing the contrary compiler. |
Week 11
(From August 2 - To August 8) |
Finish developing the contrary compiler which takes in XML(.dix) file and generate the InterNostrum format. |
Week 12
(From August 9 - To August 15) |
Creating Round trip checker and perform the validation for the both way of conversion. |
Week 13
(From August 16 - To August 19) |
Finishing the documentation and final testing. |
Deliverable |
The final Compiler for the both way conversion of the format of Dictionary (.dix)files. |
Other summer plans
Google Summer of Code would be my main plan for the whole summer. For the first phase of GSoC I will be taking my exams so I will finish most of the work in the community bonding phase. For the rest of GSoC I'll be able to dedicate around 30 to 40 hours that week to the project.