User:Commial/GSoCApplication2011

From Apertium
Jump to navigation Jump to search

Email: camille.mougey@ensimag.fr

MOUGEY Camille

First Year ENSIMAG (Grenoble)

Address : mougeyc@ensimag.fr

IRC : commial/ajax

Website : [1]

Blog : [2]




Application for : APERTIUM : Improvements to postedition interface



Contents

Why is it you are interested in machine translation?[edit]

Currently I'm in a school with a lot of exchange program, so I continually see a mix of society, of manners, of culture, but what is the more “visible” is the mix of language. It is necessary to understand everyone, and of course we don't have time, or inclination, to learn a language, just for a work, just for a e-mail answer … A machine translation becomes necessary, and this machine have to work best as possible and be simplest possible use.

In addition, there is an aspect reached by machine translation, which is not reached by most applications : strong link with human. Indeed, we tell the computer to mimic the human in its own domain, the language.

But this is just my point of view, I think the people participating to the project all have different and varied reasons :) .


Why is it that you are interested in the Apertium project?[edit]

I'm really enthusiast to help the community to advance, because although I use open source software, I never had the opportunity to participate in the adventure.

I chose this project due to two main reasons :


-> I think if we want a project to be develop, it have to “touch” many people. And most people aren't accustomed to download, install, configure and use shell to obtain a result. Currently, with the development of the cloud, most people want on-line services, accessible with just a click. It's why, for me, the Web Interface is very important and have to use all the power of the tool behind, alias Apertium.

-> My skills concerns web development, particulary PHP, Javascript ( and the mix : AJAX ). I really like to develop with these languages, and for this reason, I have accumulated experience in this kind of development.


Which of the published tasks are you interested in? What do you plan to do?[edit]

I want to apply for the Google Summer of Code project named “Improvements to the Advanced Web Interface”. Below, my Summer's planning .

Due to school end of year's project, I would start on the June 10


Date Plan to do
Week 1 More work hours are planned to make up for lost weeks.


Port the code for all recent php versions; By the way, finish to familiarize with the code[edit]

for example :


               if($_FILES["in_doc"] AND !($_FILES["in_doc"]["error"] > 0))
               file_put_contents

become :

               if(isset($_FILES[“in_doc”] and !empty($_FILES[“in_doc”]) ...
               fopen, fwrite, fclose


Rewrite the Javascript as separate modules so that it be easy to decide which tools to enable or disable in the interface[edit]

-> Make a dependences tree of functions in current libraries

-> Write more generic functions and procedure to give foundation for modules

-> Make an user interface to enabled/disabled modules ( List of modules, recommended modules with description, use example )


Week 2 More work hours are planned to make up for lost weeks.


Rewrite language.php file as an abstract script, and interface modules for Apertium, Aspell and LanguageTool.[edit]

-> Separate the translation system and the environment management system

-> Make the translation system as an PHP Object, which is initialised with languages pairs

-> Extend the environment management system to allow writing of interfaces modules for Aspell and LanguageTool

-> Write these modules


Provide more formatting modules; currently only ODF, OOXML, html and text are supported. Mediawiki (using apertium-mediawiki) and others are wanted.[edit]

-> Add module for Rich Text Format formatting (using existing Apertium's modules)

Week 3 More work hours are planned to make up for lost weeks.

(suite) -> Add module for Mediawiki formatting (using existing Apertium's modules)

-> Make test for Pdf formatting, with pdf2html on a pdf set, and test for the reconstruction step

-> If they are inconclusive, write the pdf module

-> Provide a module using Tesseract ( who is able to recognize multiple languages and maintain a basic layout ) for Pictures. The export HTML fonctionnality can be used (jointly with html module ).

Localisation, make it possible to translate the interface into different languages.[edit]

-> The localisation is given by the browser, IP Address or set by the user

-> The choice is save ( cookies, .. )

-> The interface texts are load from files, which contain the text for every button, checkbox, ... in a specific order ( to make analysis faster than an XML format )

As it was said, people who do want to write a language file should share it with the Apertium community.


Improve overall design[edit]

-> The current design is very basic ( or non-existent )

-> The Apertium website will pass on WordPress system, so the overall design can be taken

-> If the website isn't yet on WordPress, make a design similar to the Apertium current website ( in colour, simplicity, .. )

Fix possibly remaining bugs[edit]

-> It's a task on the long time, but at this stage, I want to make a complete test of the system

-> It's include :

- Fix remaining bugs

Week 4 More work hours are planned to make up for lost weeks.

(suite)

- Make some test sets

- Optimize code

- Fix security issues


Make it possible to input a TMX to help for a translation (either with Apertium's TMX input system, or an external tool like OmegaT)[edit]

-> Ask Sergio Ortiz on the integration of TMX to identify and translate segments from a translation memory

-> Make it

Use existing server-side TMX database, so that the memory generated after a translation be stored and reused automatically for next translations in this language pair.[edit]

(It might be wise to add some kind of validation too, to make sure that people don't mess with the whole system by submitting wrong translations...)


The main idea is to permit to the user to give or alter current translations, export them to TMX format. At the same time, these modifications are saved in a server-side TMX database, which contain two kinds of dictionaries by language :

- Submitted and awaiting for approving translations

- Approved translations

-> Add it in the user interface with simply a checkbox "Reuse old translations to improve translation" and another one "Share translation results" (this choice is important due to confidentiality problem with content submitted to the engine)

-> Perhaps we can see here a way to use the logging system, may edit it to allow him to detect what the user change on the translation ( Like Igor Chtivelband said ).

It seems necessary to save the context too.

Week 5 (continue and finish)

According to Jimmy O'Regan, it seems that it's a difficult task. Time is needed.

Again, Fix possibly remaining bugs to have a good foundation for the further.

July 15 Mid Evaluation
Week 6

Provide a module to use the apertium.org web service instead of a local Apertium installation[edit]

-> Make a bridge between apertium.org website and local Apertium installation : -perhaps by parsing, with regular expression, the page translation result ( simulate the entry of user on the website, and analyse the result ), but it is expensive

-an other choice is the use of an API for the apertium.org web-service, which will avoid the phase return analysis

To make my explanation easier to understand, here is an example :

To translate "Test" on the website, you have to build a web POST request

The servers return a html file, so you have to research the expression between > and </textarea>
<label for="mark">
(perhaps a more accurate research).

But if an API is write, it allow the script to just get back the server response, because it will be only the translation, without html formatting.

-> Add the possibility for the user to use this service, in the interface


Make it faster and cross-browser compatible[edit]

-> Give, and add to JS existing modules, libraries to adapt the system for recent browser. We can down to IE 6.

-> Make it faster by analysing the time critical path

-> Make the different libraries download faster, by reducing their size ( like Google for jquery )

Week 7

(continue and finish)

This is a long task, need time.

Week 8

Improve integration with Wikipedia — it should be able to fetch pages, translate, allow to be revised and then published.[edit]

-> Integrate module for recognize Wikipedia format

-> Use WikiBhasha to fetch pages, and post them


Provide modules to integrate alternative tools for spell and grammar checking (AfterTheDeadline, etc.)[edit]

-> Make outputs during the translation process, which can be redirected to others tool

-> Define the possible interaction with the software

-> Implement these interactions

Week 9

(continue and finish)

Fix new or remaining s bugs[edit]

-> Stand back

Week 10

(continue and finish)

August 16

Suggested 'pencils down' date. Take to scrub code, write tests, improve documentation, etc.

August 26

End



List your skills and give evidence of your qualifications[edit]

Currently, I'm in the first year of the ENSIMAG ( [3] ), a school in computer engineering and applied mathematics.

I already developped some website project( like [4], [5], [6], [7] ) and do some web security audit (like CMS kwsphp : [8] ) .

I have developping skills in PHP, Html/Css, Javascript/AJAX, C, Python, ADA, and notions of algorithmic ( such as cost evaluation, data structure, .. ).


List any non-Summer-of-Code plans you have for the Summer[edit]

Apart the end of school project (cited above) and a 4 days journey with friends during august, nothing is plan during this Summer.