User:Blanda.alex

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2012 Application - adopting a new language pair fr-ro

Contact information

Name: Alexandru Blanda

Email: blanda.alexandru@gmail.com

Alternative Email: ioan.blanda@cti.pub.ro

IRC: blanda(on #apertium)

Phone: +40 745628343

Why is it you are interested in machine translation?

As a student in the field of Computer Science, I believe that one of the most intriguing paths that one can take is that of Artificial Intelligence. I considered machine translation to be one of the great applications of the AI field. Also my interest comes from the fact that for my next year's final thesis I will be working on a project related to natural language processing and pattern recognition. So I considered that being part of developing a project related to a machine translation system would be a great and most useful experience.

Why is it that you are interested in the Apertium project?

In the beginning I was drawn towards the topic. Programming and linguistics is a great combination. As I stated earlier, I have a special interest in the field of natural language processing, and before entering the university I participated in a few extra-curricular activities regarding linguistics. I found out that the Apertium project has a comprehensive and well maintained documentation that proves to be of great help, and the community is very responsive to whatever problems I may have. These aspects made me consider Apertium to be a great choice for applying to GsoC 2012.

Which of the published tasks are you interested in?

Adopting an orphaned language pair: french-romanian (fr-ro)

Why should Google and Apertium sponsor it?

The language pair I would like to develop would be, in my opinion, a great addition to the Apertium set of language pairs. The fact that some work is already done on the pair, is a major advantage, and increases the chances of success of the project. Also, given the fact that the the selected language-pair is not very well represented in other free machine translation systems (see resources [1] and [2] for a list of considered machine translation systems), I believe that a fr-ro pair would be a most useful release.

How and who it will benefit?

Given the fact that I intend to create a very thorough documentation, I believe that the language pair could stand as a tutorial for developing other language pairs inside the Apertium project, such as: italian-romanian or aromanian-romanian. Also, I believe that this project will offer a valuable free educational and cultural tool, which is in the spirit of open-source and of the Apertium organization.

What do you plan to do?

I have several aspects in mind, that I plan to implement during the project:

- build the French monodictionary;

- clean and repair the existing code(tags,paradigm definitions);

- add entries to the bilingual dictionary, as well as to the Romanian monodictionary;

- create scripts for multiple purposes: adding data, testing;

- work on transfer rules;

- solve disambiguation problems;

- produce comprising documentation, so that the language pair can be easily maintained;

Work already done

- installed Apertium and prepared the necessary environment for developing an Apertium project;

- familiarized myself with the Apertium system by working on the coding challenge;

- got to know the community, got used to the means of communication

- read part of the available documentation ;

Proposed schedule

Before the coding period

- practice working with the Apertium system;

- stay connected with the community in order to find the best solutions, for emerging questions and problems;

- search for online resources regarding the languages;

- think of and try to implement auxiliary tools and scripts that would be useful to the project;

During the coding period

Week 1-3:(21.05-10.06)

- investigate the .metadix fr dictionary, already present in the apertium-fr-ro module from the incubator

- investigate other language pairs that contain a fr monolingual dictionary

- build the fr monolingual dictionary(add paradigms, add entries)

- scripts for adding data

- scripts for testing entries in monolingual dictionaries

- documentation: comments on data files and scripts produced

Deliverable 1: Functional fr monolingual dictionary

Week 4-7:(11.06- 8.07)

- check for problems in the bilingual dictionary, and in the monolingual ro dictionary

- clean and repair problems related to the the bilingual dictionary, and in the monolingual ro dictionary: regarding entries; regarding structure of the dictionary: tags, paradigm definitions

- add new entries to the monolingual ro dictionary

- add new entries to the bilingual dictionary

- scripts for testing translations between fr-ro and ro-fr

- documentation: comments on data files and scripts produced

Deliverable 2: Functional ro monolingual dictionary and bilingual dictionary ( in time for midterm evaluation)

Week 8-10:(9.07- 29.07)

- investigate existing transfer rules files

- identify and write needed rules

- identify types of disambiguation problems in both languages

- solve disambiguation problems

- work on the .tsx file

Deliverable 3: Updated transfer rule files, .tsx file

Week 11-12:(30.07- 12-08)

- manual and automated testing of the system

- solve problems that may arise after testing

- write documentation: readme files and logs

Deliverable 4: Release quality fr-ro language pair and documentation ( in time for suggested "pencils down" date)

Week 13:(13.08- 19-08)

- Improve documentation, last minute modifications on code

Final submission

Bio

I am a 3rd year Undergraduate student in the field of Computer Science, at the Faculty of Automatic Control and Computers, Polytechnic University of Bucharest.

Mainly, I have worked with with C/C++, Java and Python but I have also basic knowledge of PHP,Javascript,XML,HTML,CSS. I have also worked in Octave(Matlab) and some functional programming languages(Haskell, Clips, Scheme). Regarding aspects more related to machine translation systems, I was enrolled in courses that dealt with formal languages, finite state machines and automata theory (as example of applications, I would mention implementing a parser using flex).

I cannot say that I have worked on a real open-source project before. However, I am more than familiar with the open-source philosophy, particularly because I have worked mostly with open-source tools and technologies, but also because our university made it a goal to encourage any activity related to open-source. In addition, I can mention an open-source project that I started for a contest, but is still in a very incipient phase(a brief description here:http://ceata.org/proiecte/get-involved/wiki).

Non-GSOC activities

GsoC 2012 would be my primary concern this summer. The only non Gsoc-activities during the coding period would be my final exams, that end on 30 May(that means a little over a week of conflicting activities: 21-30 May). However, I am confident that I will be able to dedicate a minimum of 35 hours of week, during the summer, to working on the Apertium project.

Resources

[1] http://en.wikipedia.org/wiki/Machine_translation#Applications

[2] http://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications

Coding challenge

[1]

- only two sentences translated

- need to solve problem with definite article from fr to ro

- need to solve ambiguation problem with verbs that have same form for multiple persons