User:Wongalvis/proposal

From Apertium
Jump to navigation Jump to search

Name: Alvis Wong

Email: wongalvis61498@gmail.com

IRC: Alvis

Linkedin: https://www.linkedin.com/in/wongalvis

Github: wongalvis

Interest In Machine Translation:

Being grown up in a bilingual society, learning and using two languages in parallel, and thus translating between them, has become an essential part of our daily communications, spoken and written. Less than a decade ago when web translation services became popular, it was a fascinating idea to me that machine could appear to be understanding and translating between languages. However, we all knew that they resulted in inaccurate, often funny and inappropriate, translations. The translation accuracy increases over time and it made me became more curious in the system behind, that powers a machine “brain” that interprets and speaks languages. After an interview conducted with a data science researcher, I was introduced to the basics ideas of Mathematics and Statistics applied to make machine “learn” languages. It was still a mind-boggling, and rather overwhelming idea to me. Until now, I have acquired more skills and knowledge in Computer Science and Mathematics, especially in the theories behind. I wish to explore more in machine translation by investigating through an implementation of it.

Interest in Apertium:

I first encountered Apertium during Google Code-in. Having an org that aligns with my curiosity in machine translation sparked my interest. At that time I took a rather non-technical task, inputting lexicons of English-Chinese. I hope that one day I could work on the code that powers machine translation. This year, being eligible for GSoC, I came to study more about Apertium. The more I learn about it, the more motivated I am to work on a software that would perform one of the most challenging tasks in daily communication that I have been facing for many years.

Task:

I am interested to work on “Add weights to lttoolbox”. lttoolbox is one of the core tools in Apertium, and is used to build finite-state transducers by analyz-ing lexicons. Unlike HFST, it lacks the functionality to weight lexemes and analyses, which could be used to significantly increase the accuracy of translations. The weight of a morphological analysis result represents the possible linguistic interpretations of an input word, while the resulting character string itself represents the lemma of the input. Using HFST in Finnish, weighted finite-state lexicon can be used to build a unigram tagger with 96-98 % precision for Finnish words and word segments.

I am planning to

- Implement arc weight in lttoolbox, with capabilities equivalent to HFST

- Implement section weight in lttoolbox, with behaviour similar to arc weight which enables higher accuracy and faster translation if an analysis is found using sections with lower weights (penalties)

- Build tools to acquire/convert weights by training through data sets

- Test, analyse and improve the accuracy of weighted analysis

- Recursive paradigms (if time permits)

Arc and section weight would be supported in transducers, incorporated in morphological analysis as well as generation.

Currently, a proof-of-concept of section weight is implemented with transducers sorted by their weights and will finish analysis once it is found in a transducer.

Title: Adding Support for Weights In lttoolbox

Reason to Sponsor:

Google Summer of Code (GSoC) is the ideal program for me to learn more about software development and open source, while contributing to the open source community. GSoC provides me access to resources that include most importantly mentorship, which has been helping me to challenge myself and make more substantial contributions to the open source community. One of the most difficult obstacles encountered recently during software development and research, is the lack of high performance computation capabilities with my hardware. The stipend would allow me to upgrade my hardware to support more intense computations, and enable me to learn and implement open source solutions and tools for machine translation (machine learning), graphics, etc.

Benefits:

Machine translation is used to bring convenience to people, and more importantly overcome linguistics barrier encountered during daily life communications. It enables written or spoken communication overcoming variations of cultural backgrounds not only in society, but the whole world. In recent meetings with researchers working on text mining and Bible translation, I realized the applications of machine translation could extend well-beyond bringing convenience, to making substantial impact in empowering the lives of people across tribes, societies and cultures.

Transducers with weighted sections and lexicons can be used to reduce or ideally eliminate, the impact of polluted data sets and thus analysis in machine translation. It can also be used to provide more accurate results based on context of the text. lttoolbox is one of the most widely adopted set of tools in Apertium that supports the whole pipeline of machine translation. Adding support for weights in lttoolbox will allow better performance in a free and open machine translation software. Having weight acquiring and visualising tools further extends the support for weights in lttoolbox.

Moreover, my experience in GSoC will allow me to promote open source to my peers, especially to the community in Hong Kong, more effectively. I have been one of the core team members of Open Source Hong Kong dedicated to promote open source culture in the city. I hope I could share my experience and gains through open source to encourage the younger generation in being a part of this amazing community.

Work plan:

Pre-GSoC: Update and clean work done in section weight coding challenge, read documentation and get familiar with the translation pipeline of Apertium; Get ahead

Community bonding will be used to study Apertium docs, read articles on NLP/translation/statistics, start ahead with Week 1's work

Week 1: Proof-of-concept of section weight analysis & generation

Week 2: Implement section weight analysis & generation

Week 3: Testing of section weight + Buffer

Week 4: Proof-of-concept of arc weight + Implement arc weight analysis

Deliverable #1: Section weight

Week 5: Implement arc weight analysis & generation

Week 6: Testing of arc weight

Week 7: Buffer + Research, implementation and testing of weight acquiring tools

Week 8: Implementation and testing of weight acquiring tools

Deliverable #2: Arc weight

Week 9: Implementation and testing of weight acquiring tools

Week 10: Buffer + Research in possible methods to improve accuracy and speed of weighted analysis

Week 11: Proof-of-concept / Implementation (if any)

Week 12: Proof-of-concept / Implementation (if any) + Buffer

Project completed: Weight acquiring tools + Improvements in accuracy and speed

If time permits: Recursive paradigms

During implementation and testing of weight acquiring tools, a visualizer tool is likely to be built along for more convenient testing.

The above schedule is tentative and is subject to a lot of changes with unforseen situation, for example proof-of-concept could be done quickly and implementation would be moved forward; debugging may take a longer time and the buffer days before each deliverable will be used.

I will be having an internship during the summer. My time management allows me to complete 20 hours of classes, 38 hours of assignments and study, 16 hour of research and 6 hours of entrepreneurship development each week currently. My working hours are flexible and the hours off from work will be dedicated to GSoC, exercising and social.

I will be in UTC-05:00 timezone throughout the summer.

Bio:

Greetings! I’m Alvis Wong from University of Waterloo, a candidate for Bachelor of Computer Science and Bachelor of Business Administration, a member of the Waterloo-leading entrepreneurship programme Velocity, and a current Undergraduate Research Assistant studying the implementation of iterative solvers for sparse linear systems using CUDA. I enjoy a wide range of interests not only in academic disciplines such as Computer Science, Physics and Philosophy, but also hobbies such as hiking, drawing and reading.

I am a self-motivated student who always keeps myself up for challenges. Having the arguably busiest double degree programme in my school does not stop me from pursuing my interests in various areas. Being an Undergraduate Research Assistant as a first year student was intimidating at first, but under great supervision from my professor I was able to complete the research project and achieved 10x speedup for solving Poisson equations on CUDA GPU. I have developed solid foundation in Computer Science algorithms and Mathematical theories, with recent focus on parallel computing, graphics, statistics and data analytics. With regards to statistics and data analytics, I have learnt statistical concepts and programmed analysis from Harvard’s EdX course “Data Analysis for Life Sciences”. I am looking forward to apply the skills in real life applications.

I have years of experience in open source, contributing to large orgs such as FOSSASIA, OpenMRS, Drupal and KDE, to code, design and documentation. I have acquired the skills to collaborate effectively to work on complex and large-scale systems. Being a member of Open Source Hong Kong, I have been attending conferences, local and overseas, as well as organiz-ing local community events, including hackathons, workshops and the two largest IT conference Hong Kong Open Source Conference and PyCon HK.