Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:OmarKassem/Proposal

From Apertium
Jump to: navigation, search

GSOC 2019 : Light alternative format for all XML files in an Apertium language pair[1]

Contents

[edit] Personal Details

[edit] General Summary

I am Omar Kassem, a senior Computer Engineering Student. I have some research in machine learning and deep learning, and I am currently working on my Graduation Project which is a Deep Learning problem called Visual Question Answering. Last year I started learning more about NLP and I found it very interesting. I am currently working on solving a challenge offered by GoogleAI about Gendered Pronoun Resolution which is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language.


[edit] Contacts

Email : omarahmed1473@outlook.com
LinkedIn : https://www.linkedin.com/in/omarQasim10/
IRC : OmarKassem
Github : https://github.com/omarahmed10
Time zone : GMT+2


[edit] Education

I am a student at Alexandria University in Egypt my Major is Computer Engineering. The undergraduate curriculum in Computer and System Engineering at Alexandria University introduces me to a wide variety of engineering subjects. Various courses like Artificial Intelligence, Data Mining, Deep Learning, Networks, Compilers, Data Structures & Algorithms, Software Engineering, Operating Systems provided me with a strong footing in the theoretical concept of Computer Science and Engineering. While offering both depth and breadth across this field, these courses put into perspective the importance and relevance of Computer Science and Engineering and the application of its fundamentals to the problems faced by the real world. So, I can realize that learning and developing my knowledge of Computer Science.


[edit] Last Year GSoC

I had applied to classical language tool-kit project (cltk)[2] to enhance Arabic support and adding new functionalities (e.g. Word segmenter, Lemmatization, Part-of-speech tagging, etc.) and that was my proposal[3] but Unfortunately I hadn't been accepted in the program. This year I will apply in this task only.

[edit] Why interested in apertium ?

As I started being interested in NLP, the best way to explore more and learn in this field is the open source. As in the open source, people from different regions/places contribute and which is the heart of a good machine translation system. Apertium is the best open source project that I have found in the field which is easy to understand and contribute and have very active and supporting contributors.


[edit] Project Idea

[edit] Code/Plain-text formats for Apertium data

The MT strategy used in the system is a classical shallow-transfer or transformer system consisting of an 8-module assembly line. To ease diagnosis and independent testing, modules communicate between them using text streams. This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural language processing tasks.

The main idea is developing compilers to convert the linguistic data into the corresponding efficient form used by each of the modules of the engine. Four compilers are used in this project:


Here is a brief description of the proposed Compilers of all modules:-

1- A source-to-source compiler which takes in MorphTrans-style format (with keywords in English) (described here) and generate the current XML(i.e., .t1x, .t2x and .t3x).

2- An XSLT stylesheet which, executed on a standard XSLT processor, reads in the XML file with structural transfer rules and generate MorphTrans-style code.

3- A source-to-source compiler which takes in InterNostrum formatted file (described here) and output a (.dix) file which is used in the four lexical processing modules (morphological analyser, lexical transfer, morphological generator, post-generator). These modules are currently reads binary files containing a compact and efficient representation of a class of finite-state transducers. These binaries are generated from (.dix) XML dictionaries.

4- An multi-pass XSLT stylesheet to convert the XML Dictionary file (.dix) to InterNostrum formatted file (Morphological text-format dictionary)


- Round trip check tool will be used to convert from text to xml and then xml back to text or vice-versa to validate there is no error during the conversion from one form to another. - Also the Makefiles will be edited so the compilers will run to compile the new files once they are updated to convert them to the other formate (e.g XML)


[edit] Coding Challenge

I have successfully completed the challenge. I wrote a parser to convert *.mode shell-scripts fragment into a modes.xml file.
The code is available on Github.[4]

[edit] Why google and apertium should sponsor it ?

An adequate documentation of the code and auxiliary files is crucial for the success of open source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The format used now is XML which is very overt and clear, but clumsy and hard to write. It also may be difficult for some developers and make the development difficult and time consuming for them. So, there is need of removing these complexity and providing simple way which will be in text format to write the dictionary and the transfer rules.


[edit] How and who will it benefit in society ?

Changing the format of the dictionaries and the structure transfer rules will allow the linguist to focus on describing the lexicon and morphology of the language in question in a simple format and frees him or her of having to think as a programmer.

[edit] Schedule

[edit] Pre-GSoC

1- Investigating more about Apertium and studying how Apertium process the data stream across the modules.
2- Contacting with the mentors to know more about the MorphTrans-style
3- Learning to use XSLT stylesheet.

[edit] First milestone

I will start working on the task early (typically I will start coding on May 6 once google announce the results to save time as my final exams will start on May 27 to June 20 ).

Week 1

(From May 6 - To May 13)

Investigating more about the transfer rule files and understanding the MorphTrans-style

Week 2

(From May 13 - To May 20)

Researching the best way of augmenting the current MorphTrans-style and expanding it to adapt it to the other .t2x and .t3x files

Week 3

(From May 20 - To May 27)

Developing the first compiler which takes in MorphTrans-style format and generate the current XML(i.e., .t1x, .t2x and .t3x).

Deliverable

The MorphTrans to XML compiler

Week 4

(From May 27 - To June 20)

I will be taking my exams and I will be available for any changes in the delivered work or any discussion for the upcoming tasks.


[edit] Second milestone

Week 5

(From June 21 - To June 28)

Fixing any error in the MorphTrans to XML compiler, start working on the compiler that will do the contrary: take the XML and generate the Morphtrans' style input using XSLT.

Week 6

(From June 28 - To July 4)

Finish developing the compiler which takes XML and generate the Morphtrans' style input using XSLT.

Week 7

(From July 5 - To July 11)

Creating Round trip checker and perform the validation for the both way of conversion.
Updating the Makefile so updating one formate will affect the other.

Deliverable

The final Compiler for the both way conversion of the format of Transfer Rules


[edit] Third milestone

Week 8

(From July 12 - To July 18)

Investigating more about the different dictionary files, understanding the InterNostrum file and finding an optimal way to specify the alphabet for InterNostrum.

Week 9

(From July 19 - To July 25)

Developing the compiler which takes in InterNostrum format and generate the XML(.dix) file.

Week 10

(From July 26 - To August 1)

Finish developing the compiler which takes in InterNostrum format and generate the XML(.dix) file, Starting investigating the Metadix format and start developing the contrary compiler.

Week 11

(From August 2 - To August 8)

Finish developing the contrary compiler which takes in XML(.dix) file and generate the InterNostrum format.

Week 12

(From August 9 - To August 15)

Creating Round trip checker and perform the validation for the both way of conversion.
Updating the Makefile so updating one formate will affect the other.

Week 13

(From August 16 - To August 19)

Finishing the documentation and final testing.

Deliverable

The final Compiler for the both way conversion of the format of Dictionary (.dix)files.

[edit] Other summer plans

Google Summer of Code would be my main plan for the whole summer.
For the first phase of GSoC I will be taking my exams so I will finish most of the work in the community bonding phase.
For the rest of GSoC I'll be able to dedicate around 30 to 40 hours that week to the project.

Personal tools