User:Andrei zene/Applications

From Apertium
Jump to navigation Jump to search

Accent and diacritics restoration by Andrei Zene, Apertium, 2011

Problem description Language translation is a very delicate problem, especially when we talk about chatting or searching the internet. The thing that makes it so delicate is that the user often doesn’t type in the diacritics or the accents. This makes Apertium unable to translate accurately words that normally would contain diacritics or accents. The aim of my project is to restore the diacritics and the accents, given an input text. This will help users to obtain the translation they need more quickly as they won’t have to type the diacritics and accents anymore and will also help Apertium to be able to integrate with instant messaging applications, search engines and other fields in which diacritics and accents are often mistyped.

Implementation plan I always thought that it’s good to know how the “wheel” was built and it is even fun to try to make a “wheel” by yourself but that is what you can do in your free time. When talking about a project that would probably be used in the whole world of maybe millions of users, I think you shouldn’t built the “wheel” again. That’s why I would like to use the charlifter project as a start point(charlifter has been trained and is able to restore the accents and diacritics for more than 100 languages). The implementation plan would consist of two stages: • Porting charlifter from Perl to C++ • Optimize the smoothing of the statistical models from the charlifter project on a language-by-language basis



Skills I am now a student in first year at Technical University of Cluj-Napoca, Computer Science department and through the years I developed the following skills: • C – I learned it in first semester in the faculty (I was already familiar with C++) • C++ - I learned it in high school • Algorithmical thinking – developed in the high school. • OOP – I developed some skills of OOP alone last year trying to make a spell-checker for the Romanian language(Dr. Text) for my final school project (2010). I participated with the spell-checker in a national open-source software applications contest (Infoeducatie) and it was ranked the 8th place out of 35. The spell-checker was written in C++. • Knowledge of linguistic issues – I came across with those when I made the spell-checker.

Native skills: • I can learn things quickly. At the time I started to make the spell-checker (February 2010) the only thing I knew was to write a console application in C++ and the Levenshtein distance algorithm. I did not have the smallest clue of how am I going to make it but I started to read and to work and in May my project was able to compete in the regional phase of the contest I wrote and qualified for the national phase. • Good at spelling. I don’t know how and why… maybe I inherited it from my parents, but until now I have learned three languages(Romanian, English, German) and I didn’t have problems with spelling at any of them. I admit that I have problems at grammar at English and German but when it comes to spelling … that’s my area. I think this is an important skill to have in a project like “Restoration of missing diacritics and accents”.


Skills I would like to develop in the close period: • Perl – I would like to learn Perl because I find it interesting and in order to be able to port the charlifter project to C++. • FST – This semester we started to learn about Finite State Automata at our Digital System Design course. I think this would help me to better understand the Finite state transducer. • Statistical restoration algorithms

Having all this skills and also with the help of the mentor, I think I will be able to finish my project succesfully.

Deliverables • The charlifter project will be optimized and integrated in Apertium. • Tests.

Timeline ProposalBefore 25th April o To get used with the Apertium and charlifter source code. • April 25 – May 30 o To familiarize myself with Perl and to code in Perl and to port parts of charlifter from Perl to C++. o To understand the algorithms behind charlifter. • May 30 – June 19 o I will take a break to learn and take my exams. During this period I will read the mailing list and will keep in touch with my mentor and if I have time I will try to code as much as my schedule allows me. • June 20 – July 5 o Hopefully I have passed all my exams and am free to code. o To finish porting charlifter from Perl to C++. o To integrate charlifter in Apertium o From now on until the end of summer: to code more hours/day to recover the time “lost” with the exams. • July 5 – July 15 o To test end evaluate. • July 16 – August 15 o To improve smoothing of the statistical models for as many languages as I can o This period also includes testing because I can’t know if I improved them without testing. • August 15 – August 22 o I kept a backup week for any unpredictable reason.

Non-Summer-of-Code plans My only non-Summer-of-Code plan is to pass all my exams of which I have written in the Timeline proposal.

Why should Google and Apertium sponsor this project? As a Romanian native speaker I have to admit that most of the Romanians don’t type the diacritics and I think this happens to other folks too. One reason of why Google and Apertium should sponsor this project is that it would bypass the diacritics and accents problem and would be of great use to users. Secondly, I think this project will highly increase the applicability of Apertium by making it able to integrate with Instant Messaging, IRC, Search Engines, Websites Translating, etc.