From Apertium
Revision as of 12:00, 8 April 2019 by OmarKassem (talk | contribs)
Jump to navigation Jump to search

GSOC 2019 : Light alternative format for all XML files in an Apertium language pair[1]

Personal Details

General Summary

I am Omar Kassem, a senior Computer Engineering Student. I am currently living in Alexandria, Egypt, and I intend to study masters abroad after finishing my undergraduate study. I have some research machine learning and deep learning, and I am currently working on my Graduation Project which is a Deep Learning problem called Visual Question Answering. Last year I started learning more about NLP and I found it interesting. I am currently working on solving a challenge offered by GoogleAI about Gendered Pronoun Resolution which is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language.


Email :
LinkedIn :
IRC : OmarKassem
Github :
Time zone : GMT+2


I am a senior bachelor student at Alexandria University in Egypt. Recently I have been granted a scholarship to study masters in data science at Innopolis University in Russia.
My undergraduate major is computer engineering, which exposed me to almost everything in computers from the lowest level of zeros and ones to the highest level of HCI (human and computer interaction, mainly deals with user interface).
The subjects I loved the most were artificial intelligence, machine learning, data mining and deep learning, and that's because of the great potential in the AI field that already solved and could solve many of the problems humans face today.

Last Year GSoC

I then applied to classical language tool-kit project (cltk)[2] to enhance Arabic support and adding new functionalities (e.g. Word segmenter, Lemmatization, Part-of-speech tagging, etc.) and that was my proposal[3] but Unfortunately I wasn't accepted in the program. This year I applied only for this one task.



Last summer I was a IOS developer intern in InovaEg(company here in Alex). I was working on adding new features for Ummahlink IOS app using Swift programming language.

Online courses

I had taken many online courses in many of the computer engineering tracks. The most remarkable course I finished was udacity's machine-learning nano-degree which is a six-months program and this is the Certificate[4]. In this program I mastered Supervised, Unsupervised, Reinforcement, and Deep Learning fundamentals.

Why interested in apertium ?

As I started being interested in NLP, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributors.

Project Idea

Coding Challenge

The coding challenge was to set up a pair and train the existing weighted transfer rule code, which I had done several times while testing and debugging the code.
Since I didn't have a coding challenge and also the module was separated from apertium core as mentioned before, Francis Tyers(spectei) told me integrate the module -without the training part- with apertium-transfer, and I did that in this pull-request[5].
Then he told me to make the module depends on libraries already used in apertium and not external ones, as I used 2 libraries pugixml to handle xml files and icu library to handle upper and lower cases, which are not used in apertium. Also Kevin Unhammer(unhammer) gave me some helpful review on the code, and these issues were resolved.

Why google and apertium should sponsor it ?

The improvement in the current language pair and the addition of the new language pair is a continuous process in the Apertium project. The current way of writing the dictionary and the transfer rule is in the format of xml which may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.

How and who will it benefit in society ?

As the project will hopefully enhance apertium translation and make it closer to human translation, apertium will be more reliable and efficient to use in daily life and for document translation, which -in the long term- will enrich the data of languages with data scarcity, and hence help the speakers of such languages enriching and preserving their languages from extinction.

Other ideas ?

I would love to work on "Light alternative format for all XML files in an Apertium language pair"[6] idea along with weighted transfer rules idea too, if there was enough time. As there is an intersection between the two ideas which is the xml transfer files, and since I am already familiar with the documentation of these files, and has written module to handle, match and apply the rules, I think I could design another lighter format than xml, and write converters scripts between the two formats.
I hope in the next few days, I would be to able to finish the coding challenge of this idea so I could be considered working on it too if no other one applied to it.

Work plan

Exams and community bounding

I am having my final exams from May 27 to June 20 and it's almost exactly the same as the first phase of GSoC this year, and since I will not be able to work in my exams duration and even I want at least one free week before the first exam, I will start earlier, even before the announcement of accepted students, and that's because I will continue contribution to the module anyways, if I got accepted or not.
So I will start working on the first phase on April 19 to May 16. And from May 17 to July 20 I will be taking my exams and I will still be able to do minor changes if necessary, and also will be open for discussions and chats about the first phase and the next one, to be ready when I came back to design and implement the code.



Week 1

(From April 5 - To April 11)

Continue code reformatting as proposed by mentors.

Week 2

(From April 12 - To May 18)

See what mentors say next to modify in the code.
Discuss with them some of the thoughts on the proposed documentation.
Begin new refactored module documentation.


Weighted transfer rules module is integrated with apertium-transfer.

First milestone

Week 1

(From April 19 - To April 25)

If code needs further refactoring, bugs/issues fixing, polishing, documentation, etc. Start in it.

Week 2

(From April 26 - To May 2)

Start Designing and Implementing some of the valid thoughts, ideas proposed or discussed with mentors. For now I think sentence splitting is the most promising idea, also may be substituting yasmet with another tool or method.

Week 3

(From May 3 - To May 9)

Continue coding and start testing and debugging.

Week 4

(From May 10 - To May 16)

Finish coding, testing and debugging. Write documentation. Train one chosen pair and evaluate its accuracy.


Hopefully, more accurate, clean and robust weighted transfer rules module.

Week 5

(From June 21 - To June 28)

After exams, I will familiarize myself again with the code because my memory is not good enough :) . Also write the mentor evaluation, complete any unfinished documentation, tests or evaluations, and fix any reported issues or bugs.

Second milestone

Week 5

(From June 28 - To July 4)

Read apertium2 document again, read deprecated or out of date parts from different sources and collect all the up to date transfer files specifications in a new document.

Week 6

(From July 5 - To July 11)

Fix any errors found in the module after collecting the up to date specifications.
Update and modify the ambiguous transfer file code to handle both inter- and post-chunk transfer files.

Week 7

(From July 12 - To July 18)

Continue coding and start testing and debugging.

Week 8

(From July 19 - To July 25)

Fix any reported bugs or issues.
Finish coding, testing and debugging. Write documentation. Train one chosen pair and evaluate its accuracy.
Writing mentor evaluation.


Extended weighted transfer rules module.

Third milestone

Week 9

(From July 26 - To August 1)

Fix any reported bugs or issues on the previous deliverable.
Start in a new proposed idea regarding weighted transfer rules or regarding the light weight alternative for xml.
If later was chosen, then I will start familiarizing myself with interNOSTRUM-style.
Start designing and documenting an interNOSTRUM-style format for at least the transfer rules XML files.

Week 10

(From August 2 - To August 8)

Start writing converters to XML and from XML.

Week 11

(From August 9 - To August 15)

Continue coding. Fix any reported bugs or issues.
Finish coding, debugging and testing and comparing results with XML.

Week 12

(From August 16 - To August 19)

Write documentation.
Write mentor evaluation.


New light interNOSTRUM-style format that's alternative to XML format, with converters from and to the two formats.

Other summer plans

- For the part-time job, as I was told by Francis that it's not compatible with GSoC, I decided to leave the job by April 15 before the first phase.
- For the first phase of GSoC I will still be in my college, but I will be able to allocate at least 30 hours per week for GSoC.
- For the second and third phases, college will have been finished, and I will be able to allocate at least 40 hours per week for GSoC.