User:AMR-KELEG/GSoC19 Proposal

From Apertium
Jump to navigation Jump to search

Personal Information

  • Name: Amr Keleg
  • E-mail address: amr.keleg@eng.asu.edu.eg / amr_mohamed@live.com
  • IRC: AMR-KELEG
  • Location: Cairo, Egypt
  • Timezone: UTC+02
  • Current job: A MSc student and a teacher assistant at Computer and systems department, Faculty of Engineering, Ain Shams university, Cairo, Egypt.

Qualifications

  • I graduated as the first of my class of 138 students (Computer and systems department, Faculty of Engineering, Ain Shams University).
  • I have worked for one year as a full-time machine learning engineer. My role was developing sentiment analysis model for Arabic language.
  • As a student, I have participated in online (Google codejam)and on-site (ACM Collegiate programming contest) competitive programming contests.

Throughout those participations, I solved more than 700 problems on different online judges.

  • I am interested in open source communities and have made several contributions to open source projects (cltk - gensim - asciinema - octave and apertium).
  • I have Completed Udacity's data analysis nanodegree. Throughout those courses, I had to use python/ R and Tableau to perform analysis on different data-sets.

Skills

  • Experience in coding with C++ and python.
  • Good command of git and the GitHub process of contribution.
  • Usage of Ubuntu as the main OS for more than 3 years.
  • Basic knowledge of shell scripting.
  • Basic knowledge of using gdb to debug large C++ projects.

Project Information

Why is it that you are interested in Apertium?

I am interested in NLP and especially the idea of how to enable machines to understand and reason about human languages. This field have made it possible to perform tasks that couldn't have been done before. One of the interesting applications of NLP is Machine Translation. Machine Translation programs such as: Apertium have permitted people to automatically translate text from other languages. This has improved the way that we people share knowledge and experience.

One of the main points that attracted me to Apertium is the fact that most of the maintainers are actually researchers. So not only the program is developed by experienced and skilled developers but also it's maintained by academic researchers that have good understanding of the field and the limitations/difficulties of automatic machine translation.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in working on "Unsupervised weighting of automata". The task's main aim is to reduce the ambiguity of the analyses generated by non-deterministic finite state transducers. The task should in return improve the way Apertium ranks its analyses for most if not all of the developed language pairs.

Reasons why Google and Apertium should sponsor it

I am currently pursuing my masters degree in computer science. Participation in GSoC will for sure be a great step towards becoming a better researcher. The project will give me the chance to read, understand and implement the ideas mentioned in different publications. This will be a great experience and will help me acquire new and necessary skills.

How and who the project will benefit in society?

The project should help developers and users generate weights for FST in an unsupervised way. This has the advantage of using large corpus to generate these weights without the requirement of manual annotation.

And how does this in turn benefit society? —Firespeaker (talk) 19:57, 30 March 2019 (CET)

Coding challenge

Code repository: https://github.com/AMR-KELEG/apertium-unsupervised-weighting-of-automata

Steps completed:

  • Used Apertium's analyser to generate the analyses for each tagged token.
  • Created a unigram counter to estimate the probability of each analysis given a token.
  • Used the unigram counts to rank the generated analyses for each token.
  • Generate weighted string-pairs from the corpus.
  • Use hfst-strings2fst and hfst-fst2txt to convert the weighted FST into att format.
  • Use lt-comp to generate a bin file for the FST using apertium's lttoolbox.
  • Use lt-proc given the new weighted fst.

Note: You will need to build the master branch of lttoolbox so that the analyses weights are computed correctly. (https://github.com/apertium/lttoolbox/commit/473766aba1704e0fa2b5c1c5672a728a0a20d390)

Relevant publications

Weighting of automata

Background papers

  • Weighted Finite-State Transducers in Speech Recognition (2002) (https://cs.nyu.edu/~mohri/postscript/csl01.pdf)
    • Main points:
      • Basic concepts (Semi-ring/ Types of semi-rings/ Basic operations on transducers (Composition - Determinization - Minimization))
  • An Efficient Algorithm for the n-Best-Strings Problem (2002) (https://pdfs.semanticscholar.org/aa78/148fd79b10962a15c5aa7ec95c573250c3f6.pdf)
    • Main points:
      • Basic concepts (Determinization of WFST (Extension of subset construction method))
      • An efficient algorithm for determining the n-best paths other than brute forcing (The main idea is to only allow n-paths to visit any node of the transducer.

e.g: If you visited a node for more than n times then for sure the n best paths should have been part of the previous visits).


Work Plan

Community Bonding Communicate with the maintainers and get to know Apertium better.

Solve some issues on Github. Prepare a better list of publications that are going to be implemented. Implement a baseline model for weigthing automata.

Week 1

(27 May - 3 June)

Develop the first supervised model (Unigram counts).

Write a shell script for generating weights using a tagged corpus.

Week 2

(4 June - 10 June)

Read, Understand and plan for implementing the publication for the first unsupervised model.
Week 3-4

(11 June - 24 June)

Finalise the first unsupervised model and compare it to the supervised one.
Evaluation 1

Deliverables: Two shell scripts for generating weights using both supervised and unsupervised techniques.

Week 5

(29 June - 5 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.
Week 6

(6 July - 12 July)

Implement the second unsupervised model.
Week 7

(13 July - 22 July)

Read, Understand and plan for implementing the publication for the second unsupervised model.
Week 8

(23 July - 12 July)

Implement the second unsupervised model.
Evaluation 2

Deliverables: A shell script for using the second unsupervised model and a plan for implementing the third one.

Week 9

(27 July - 2 August)

Implement the third unsupervised model.
Week 10

(3 August - 9 August)

Solve issues related to the developed models.
Week 11-12

(10 August - 26 August)

Write the required documentation and merge the code into Apertium's repositories.
Final evaluation

Contributions to Apertium

I have managed to fixed multiple issues in different repositories.

Merged pull requests: