Difference between revisions of "User:AMR-KELEG/GSoC19 Proposal"

From Apertium
Jump to navigation Jump to search
m
m (ScoopGracie moved page User:GSoC19 AMR-KELEG proposal to User:AMR-KELEG/GSoC19 Proposal: These go under individual user's pages)
 
(32 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Personal Information =
Name: Amr Keleg


* Name: Amr Keleg
E-mail address: amr.keleg@eng.asu.edu.eg / amr_mohamed@live.com


* E-mail address: amr.keleg@eng.asu.edu.eg / amr_mohamed@live.com
IRC: AMR-KELEG


* IRC: AMR-KELEG
Twitter: https://twitter.com/amrkeleg


* Location: Cairo, Egypt
Github: https://github.com/AMR-KELEG


Timezone: (Cairo, Egypt) UTC+02
* Timezone: UTC+02


* Github: https://github.com/AMR-KELEG
Why is it that you are interested in Apertium?


* LinkedIn: https://www.linkedin.com/in/amr-mohamed-hosny/
Which of the published tasks are you interested in? What do you plan to do?


* Twitter: https://twitter.com/amrkeleg
Include a proposal, including
* a title,
* reasons why Google and Apertium should sponsor it,
* a description of how and who it will benefit in society,
* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).


*Current job: A MSc student and a teacher assistant at Computer and systems department, Faculty of Engineering, Ain Shams university, Cairo, Egypt.
=== Work plan ===


*Experimental blog: https://ak-blog.herokuapp.com
* Week 1:
* Week 2:
* Week 3:
* Week 4:


=== Qualifications ===
* '''Deliverable #1'''


* I graduated as the first of my class of 138 students (Computer and systems department, Faculty of Engineering, Ain Shams University).
* Week 5:
* Week 6:
* Week 7:
* Week 8:


* I have successfully participated as a student in GSoC 2016 as part of the GNU Octave organisation(https://summerofcode.withgoogle.com/archive/2016/projects/5461783343005696/)
* '''Deliverable #2'''


* I have worked for one year as a full-time machine learning engineer. My role was developing sentiment analysis model for Arabic language.
* Week 9:
* Week 10:
* Week 11:
* Week 12:


* As a student, I have participated in online (Google codejam)and on-site (ACM Collegiate programming contest) competitive programming contests.
* '''Project completed'''
Throughout those participations, I solved more than 700 problems on different online judges.


* I am interested in open source communities and have made several contributions to open source projects (cltk - gensim - asciinema - octave and apertium).
Include time needed to think, to program, to document and to disseminate.


* I have Completed Udacity's data analysis nanodegree. Throughout those courses, I had to use python/ R and Tableau to perform analysis on different data-sets.
If you are intending to disseminate to a conference, which conference are you intending to submit to. Make sure
to factor in time taken to run any experiments/evaluations and write them up in your work plan.


=== Skills ===
List your skills and give evidence of your qualifications. Tell us what is your current field of study,
* Experience in coding with C++ and python.
major, etc. Convince us that you can do the work.


* Good command of git and the GitHub process of contribution.
List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for
internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have
at least 30 free hours a week to develop for our project.


* Usage of Ubuntu as the main OS for more than 3 years.
[[Category:GSoC 2019 student proposals|AMR-KELEG]]

* Basic knowledge of shell scripting.

* Basic knowledge of using gdb to debug large C++ projects.

= Project Information =
== Why is it that you are interested in Apertium? ==
I am interested in NLP and especially the idea of how to enable machines to understand and reason about human languages. This field have made it possible to perform tasks that couldn't be done before. One of the interesting applications of NLP is Machine Translation. Machine Translation programs such as: Apertium have permitted people to automatically translate text from other languages. This has improved the way that we people share knowledge and experience.

One of the main points that attracted me to Apertium is the fact that most of the maintainers are actually researchers. So not only the program is developed by experienced and skilled developers but also it's maintained by academic researchers that have good understanding of the field and the limitations/difficulties of automatic machine translation.

== Which of the published tasks are you interested in? What do you plan to do? ==
I am interested in working on "Unsupervised weighting of automata".
The task's main aim is to reduce the ambiguity of the analyses generated by non-deterministic finite state transducers. The task should in return improve the way Apertium ranks its analyses for most if not all of the developed language pairs.

== Reasons why Google and Apertium should sponsor it ==
I am currently pursuing my masters degree in computer science. Participation in GSoC will for sure be a great step towards becoming a better researcher. The project will give me the chance to read, understand and implement the ideas mentioned in different publications. This will be a great experience and will help me acquire new and necessary skills.

== How and who the project will benefit in society?==
The project should help developers and users generate weights for FST in an unsupervised way. This has the advantage of using large corpus to generate these weights without the requirement of manual annotation.

Developers and users aiming at using Apertium to translate documents whose vocabulary is too specific to be included in the available annotated datasets will be able to adjust the weights of the FST according to a large un-annotated corpus of relevant documents.

== Coding challenge ==

Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata

Steps completed:
* Used Apertium's analyser to generate the analyses for each tagged token.
* Created a unigram counter to estimate the probability of each analysis given a token.
* Used the unigram counts to rank the generated analyses for each token.
* Generate weighted string-pairs from the corpus.
* Use hfst-strings2fst and hfst-fst2txt to convert the weighted FST into att format.
* Use lt-comp to generate a bin file for the FST using apertium's lttoolbox.
* Use lt-proc given the new weighted fst.

Note: You will need to build the master branch of lttoolbox so that the analyses weights are computed correctly.
(https://github.com/apertium/lttoolbox/commit/473766aba1704e0fa2b5c1c5672a728a0a20d390)

== Relevant publications ==
=== Weighting of automata ===
* Parameter Estimation for Probabilistic Finite-State Transducers (https://aclweb.org/anthology/P02-1001)

=== Background papers ===
* Weighted Finite-State Transducers in Speech Recognition (2002) (https://cs.nyu.edu/~mohri/postscript/csl01.pdf)
** Main points:
*** Basic concepts (Semi-ring/ Types of semi-rings/ Basic operations on transducers (Composition - Determinization - Minimization))

* An Efficient Algorithm for the n-Best-Strings Problem (2002) (https://pdfs.semanticscholar.org/aa78/148fd79b10962a15c5aa7ec95c573250c3f6.pdf)
** Main points:
*** Basic concepts (Determinization of WFST (Extension of subset construction method))
*** An efficient algorithm for determining the n-best paths other than brute forcing (The main idea is to only allow n-paths to visit any node of the transducer.
e.g: If you visited a node for more than n times then for sure the n best paths should have been part of the previous visits).


== Work Plan ==

{| class="wikitable" border="1"
|-
| Community Bonding
| Communicate with the maintainers and get to know Apertium better.

Solve some issues on Github.

Prepare a better list of publications that are going to be implemented.

Implement a baseline model for weigthing automata.

Solve some issues on Github.

Prepare a better list of publications that are going to be implemented.

Implement a baseline model for weighting automata.

|-
| Week 1
(27 May - 3 June) *
| Develop the first supervised model (Unigram counts).

Write a shell script for generating weights using a tagged corpus.
|-
| Week 2
(4 June - 10 June) *
| Read, Understand and plan for implementing the publication for the first unsupervised model.

Document the findings in a blog post.
|-
| Week 3-4
(11 June - 28 June)
| Finalise the first unsupervised model and compare it to the supervised one.
|-
|-
| '''Evaluation 1'''
'''Deliverables: Two shell scripts for generating weights using both supervised and unsupervised techniques.'''
|-
| Week 5
(29 June - 5 July)
| Read, Understand and plan for implementing the publication for the second unsupervised model.

Document the findings in a blog post.
|-
| Week 6
(6 July - 12 July)
| Implement the second unsupervised model.
|-
| Week 7
(13 July - 22 July)
| Read, Understand and plan for implementing the publication for the second unsupervised model.

Document the findings in a blog post.
|-
| Week 8
(23 July - 26 July)
| Implement the third unsupervised model.
|-
| '''Evaluation 2'''
'''Deliverables: A shell script for using the second unsupervised model and a plan for implementing the third one.'''
|-
| Week 9-10
(27 July - 9 August)
| Solve issues related to the developed models.

Finalise and package the scripts.
|-
| Week 11-12
(10 August - 26 August)
| Write the required documentation and merge the code into Apertium's repositories.
|-
| '''Final evaluation'''
|
|}

* Note: I will have an examination as part of my Masters degree in June. I will inform my mentor whenever the schedule is announced. (I will need to study three days before the exam so my working hours will decrease during these days).

== Contributions to Apertium ==
I have managed to fix multiple issues in different repositories of Apertium's code-base.

Merged pull requests:
* apertium-ml-ar: Update the README file https://github.com/apertium/apertium-mlt-ara/pull/3
* phenny: Fix the generated url for the logs file https://github.com/apertium/phenny/pull/475
* lttoolbox: Fix the analysis weight computation bug https://github.com/apertium/lttoolbox/pull/49
[[Category:GSoC 2019 student proposals]]

Latest revision as of 15:30, 27 April 2020

Personal Information[edit]

  • Name: Amr Keleg
  • E-mail address: amr.keleg@eng.asu.edu.eg / amr_mohamed@live.com
  • IRC: AMR-KELEG
  • Location: Cairo, Egypt
  • Timezone: UTC+02
  • Current job: A MSc student and a teacher assistant at Computer and systems department, Faculty of Engineering, Ain Shams university, Cairo, Egypt.

Qualifications[edit]

  • I graduated as the first of my class of 138 students (Computer and systems department, Faculty of Engineering, Ain Shams University).
  • I have worked for one year as a full-time machine learning engineer. My role was developing sentiment analysis model for Arabic language.
  • As a student, I have participated in online (Google codejam)and on-site (ACM Collegiate programming contest) competitive programming contests.

Throughout those participations, I solved more than 700 problems on different online judges.

  • I am interested in open source communities and have made several contributions to open source projects (cltk - gensim - asciinema - octave and apertium).
  • I have Completed Udacity's data analysis nanodegree. Throughout those courses, I had to use python/ R and Tableau to perform analysis on different data-sets.

Skills[edit]

  • Experience in coding with C++ and python.
  • Good command of git and the GitHub process of contribution.
  • Usage of Ubuntu as the main OS for more than 3 years.
  • Basic knowledge of shell scripting.
  • Basic knowledge of using gdb to debug large C++ projects.

Project Information[edit]

Why is it that you are interested in Apertium?[edit]

I am interested in NLP and especially the idea of how to enable machines to understand and reason about human languages. This field have made it possible to perform tasks that couldn't be done before. One of the interesting applications of NLP is Machine Translation. Machine Translation programs such as: Apertium have permitted people to automatically translate text from other languages. This has improved the way that we people share knowledge and experience.

One of the main points that attracted me to Apertium is the fact that most of the maintainers are actually researchers. So not only the program is developed by experienced and skilled developers but also it's maintained by academic researchers that have good understanding of the field and the limitations/difficulties of automatic machine translation.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I am interested in working on "Unsupervised weighting of automata". The task's main aim is to reduce the ambiguity of the analyses generated by non-deterministic finite state transducers. The task should in return improve the way Apertium ranks its analyses for most if not all of the developed language pairs.

Reasons why Google and Apertium should sponsor it[edit]

I am currently pursuing my masters degree in computer science. Participation in GSoC will for sure be a great step towards becoming a better researcher. The project will give me the chance to read, understand and implement the ideas mentioned in different publications. This will be a great experience and will help me acquire new and necessary skills.

How and who the project will benefit in society?[edit]

The project should help developers and users generate weights for FST in an unsupervised way. This has the advantage of using large corpus to generate these weights without the requirement of manual annotation.

Developers and users aiming at using Apertium to translate documents whose vocabulary is too specific to be included in the available annotated datasets will be able to adjust the weights of the FST according to a large un-annotated corpus of relevant documents.

Coding challenge[edit]

Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata

Steps completed:

  • Used Apertium's analyser to generate the analyses for each tagged token.
  • Created a unigram counter to estimate the probability of each analysis given a token.
  • Used the unigram counts to rank the generated analyses for each token.
  • Generate weighted string-pairs from the corpus.
  • Use hfst-strings2fst and hfst-fst2txt to convert the weighted FST into att format.
  • Use lt-comp to generate a bin file for the FST using apertium's lttoolbox.
  • Use lt-proc given the new weighted fst.

Note: You will need to build the master branch of lttoolbox so that the analyses weights are computed correctly. (https://github.com/apertium/lttoolbox/commit/473766aba1704e0fa2b5c1c5672a728a0a20d390)

Relevant publications[edit]

Weighting of automata[edit]

Background papers[edit]

  • Weighted Finite-State Transducers in Speech Recognition (2002) (https://cs.nyu.edu/~mohri/postscript/csl01.pdf)
    • Main points:
      • Basic concepts (Semi-ring/ Types of semi-rings/ Basic operations on transducers (Composition - Determinization - Minimization))
  • An Efficient Algorithm for the n-Best-Strings Problem (2002) (https://pdfs.semanticscholar.org/aa78/148fd79b10962a15c5aa7ec95c573250c3f6.pdf)
    • Main points:
      • Basic concepts (Determinization of WFST (Extension of subset construction method))
      • An efficient algorithm for determining the n-best paths other than brute forcing (The main idea is to only allow n-paths to visit any node of the transducer.

e.g: If you visited a node for more than n times then for sure the n best paths should have been part of the previous visits).


Work Plan[edit]

Community Bonding Communicate with the maintainers and get to know Apertium better.

Solve some issues on Github.

Prepare a better list of publications that are going to be implemented.

Implement a baseline model for weigthing automata.

Solve some issues on Github.

Prepare a better list of publications that are going to be implemented.

Implement a baseline model for weighting automata.

Week 1

(27 May - 3 June) *

Develop the first supervised model (Unigram counts).

Write a shell script for generating weights using a tagged corpus.

Week 2

(4 June - 10 June) *

Read, Understand and plan for implementing the publication for the first unsupervised model.

Document the findings in a blog post.

Week 3-4

(11 June - 28 June)

Finalise the first unsupervised model and compare it to the supervised one.
Evaluation 1

Deliverables: Two shell scripts for generating weights using both supervised and unsupervised techniques.

Week 5

(29 June - 5 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.

Document the findings in a blog post.

Week 6

(6 July - 12 July)

Implement the second unsupervised model.
Week 7

(13 July - 22 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.

Document the findings in a blog post.

Week 8

(23 July - 26 July)

Implement the third unsupervised model.
Evaluation 2

Deliverables: A shell script for using the second unsupervised model and a plan for implementing the third one.

Week 9-10

(27 July - 9 August)

Solve issues related to the developed models.

Finalise and package the scripts.

Week 11-12

(10 August - 26 August)

Write the required documentation and merge the code into Apertium's repositories.
Final evaluation
  • Note: I will have an examination as part of my Masters degree in June. I will inform my mentor whenever the schedule is announced. (I will need to study three days before the exam so my working hours will decrease during these days).

Contributions to Apertium[edit]

I have managed to fix multiple issues in different repositories of Apertium's code-base.

Merged pull requests: