Difference between revisions of "User:AMR-KELEG/GSoC19 Proposal"

From Apertium
Jump to navigation Jump to search
Line 45: Line 45:
 
* Basic knowledge of shell scripting.
 
* Basic knowledge of shell scripting.
   
= Coding challenge =
+
= Project Information =
 
== Why is it that you are interested in Apertium? ==
  +
I am interested in NLP and especially the idea of how to enable machines to understand and reason about human languages. This field have made it possible to perform tasks that couldn't have been done before. One of the interesting applications of NLP is Machine Translation. Machine Translation programs such as: Apertium have permitted people to automatically translate text from other languages. This has improved the way that we people share knowledge and experience.
   
  +
One of the main points that attracted me to Apertium is the fact that most of the maintainers are actually researchers. So not only the program is developed by experienced and skilled developers but also it's maintained by academic researchers that have good understanding of the field and the limitations/difficulties of automatic machine translation.
Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata
 
   
 
== Which of the published tasks are you interested in? What do you plan to do? ==
= Project Information =
 
  +
I am interested in working on "Unsupervised weighting of automata".
Why is it that you are interested in Apertium?
 
  +
The task's main aim is to reduce the ambiguity of the analyses generated by non-deterministic finite state transducers. The task should in return improve the way Apertium ranks its analyses for most if not all of the developed language pairs.
  +
 
== Reasons why Google and Apertium should sponsor it ==
  +
  +
 
== How and who the project will benefit in society?==
  +
  +
== Coding challenge ==
  +
 
Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata
   
  +
Steps completed:
Which of the published tasks are you interested in? What do you plan to do?
 
  +
* Used Apertium's analyser to generate the analyses for each tagged token.
  +
* Created a unigram counter to estimate the probability of each analysis given a token.
  +
* Used the unigram counts to rank the generated analyses for each token.
   
  +
Steps to be done:
Include a proposal, including
 
  +
* Generate weighted string-pairs from the corpus.
* a title,
 
  +
* Use hfst-strings2fst and hfst-fst2txt to convert the weighted FST into att format.
* reasons why Google and Apertium should sponsor it,
 
  +
* Use lt-comp to generate a bin file for the FST using apertium's tools.
* a description of how and who it will benefit in society,
 
  +
* Use lt-proc using the new weighted fst.
* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).
 
   
 
== Work Plan ==
 
== Work Plan ==

Revision as of 20:49, 28 March 2019

Personal Information

  • Name: Amr Keleg
  • E-mail address: amr.keleg@eng.asu.edu.eg / amr_mohamed@live.com
  • IRC: AMR-KELEG
  • Location: Cairo, Egypt
  • Timezone: UTC+02
  • Current job: A MSc student and a teacher assistant at Computer and systems department, Faculty of Engineering, Ain Shams university, Cairo, Egypt.

Qualifications

  • I graduated as the first of my class of 138 students (Computer and systems department, Faculty of Engineering, Ain Shams University).
  • I have successfully participated as a student in GSoC 2016 as part of the GNU Octave organisation.
  • I have worked for one year as a full-time machine learning engineer. My role was developing sentiment analysis model for Arabic language.
  • As a student, I have participated in online (Google codejam)and on-site (ACM Collegiate programming contest) competitive programming contests.

Throughout those participations, I solved more than 700 problems on different online judges.

  • I am interested in open source communities and have made several contributions to open source projects (cltk - gensim - asciinema - octave and apertium).
  • I have Completed Udacity's data analysis nanodegree. Throughout those courses, I had to use python to perform analysis on different data-sets.

Skills

  • Experience in coding with C++ and python.
  • Good command of git and the GitHub process of contribution.
  • Usage of Ubuntu as the main OS for more than 3 years.
  • Basic knowledge of shell scripting.

Project Information

Why is it that you are interested in Apertium?

I am interested in NLP and especially the idea of how to enable machines to understand and reason about human languages. This field have made it possible to perform tasks that couldn't have been done before. One of the interesting applications of NLP is Machine Translation. Machine Translation programs such as: Apertium have permitted people to automatically translate text from other languages. This has improved the way that we people share knowledge and experience.

One of the main points that attracted me to Apertium is the fact that most of the maintainers are actually researchers. So not only the program is developed by experienced and skilled developers but also it's maintained by academic researchers that have good understanding of the field and the limitations/difficulties of automatic machine translation.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in working on "Unsupervised weighting of automata". The task's main aim is to reduce the ambiguity of the analyses generated by non-deterministic finite state transducers. The task should in return improve the way Apertium ranks its analyses for most if not all of the developed language pairs.

Reasons why Google and Apertium should sponsor it

How and who the project will benefit in society?

Coding challenge

Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata

Steps completed:

  • Used Apertium's analyser to generate the analyses for each tagged token.
  • Created a unigram counter to estimate the probability of each analysis given a token.
  • Used the unigram counts to rank the generated analyses for each token.

Steps to be done:

  • Generate weighted string-pairs from the corpus.
  • Use hfst-strings2fst and hfst-fst2txt to convert the weighted FST into att format.
  • Use lt-comp to generate a bin file for the FST using apertium's tools.
  • Use lt-proc using the new weighted fst.

Work Plan

Community Bonding Communicate with the maintainers and get to know Apertium better.

Solve some issues on Github.

Week 1

(27 May - 3 June)

Implement a baseline model for weigthing automata.
Week 2

(4 June - 10 June)

Develop the first supervised model (Unigram counts).

Write a shell script for generating weights using a tagged corpus.

Week 3

(11 June - 17 June)

Read, Understand and plan for implementing the publication for the first unsupervised model.
Week 4

(18 June - 24 June)

Finalise the first unsupervised model and compare it to the supervised one.
Evaluation 1

Deliverables: Two shell scripts for generating weights using both supervised and unsupervised techniques.

Week 5

(29 June - 5 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.
Week 6

(6 July - 12 July)

Implement the second unsupervised model.
Week 7

(13 July - 22 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.
Week 8

(23 July - 12 July)

Implement the second unsupervised model.
Evaluation 2

Deliverables: A shell script for using the second unsupervised model and a plan for implementing the third one.

Week 9

(27 July - 2 August)

Implement the third unsupervised model.
Week 10

(3 August - 9 August)

Solve issues related to the developed models.
Week 11-12

(10 August - 26 August)

Write the required documentation and merge the code into Apertium's repositories.
Final evaluation