User:Littleowl/Littleowl pet

From Apertium
Jump to navigation Jump to search

Apertium: Post-editing tool

GSoC on-line application

Abstract

My proposal is to create a web-interface (Post-editing tool) from an existing web development/viewer and web services of Apertium project, and also reusing an existing Open-Source WYSIWYG web-editor replacing its functionality for the interface or using an existing API.

It contains a list of features and linguistic resources in order to facilitate the post-edition of translated texts such as spell checkers, on-line dictionaries, etc. The list is subjected to the feedback of Apertium users.

Content

Name: Carles Sanz Casañas

E-mail address: carles.sanz@pangea.org

Other information that may be useful to contact you:

Why is it you are interested in machine translation?

I live in Catalonia where there is two official languages, Catalan and Spanish. Therefore, documentation is always in either Catalan or Spanish or even in English for international purposes. I believe that Machine Translation Systems are key tools in order to improve the communication within and between Organizations in Catalonia and over the World.

Why is it that they are interested in the Apertium project?

I am really interested in Apertium because is an open-source platform for the purpose mentioned above. And I also like the democratic spirit of open-source projects I would be very excited to take the opportunity to collaborate on this kind of project.

Which of the published tasks are you interested in? What do you plan to do?

Title

Post-edition tool

Why Google and Apertium should sponsor it

It is a tool to speed up revision of Apertium translations, therefore it will reduce the cost and time of many translations.

From an existing web development/viewer[1] and web services[2] of Apertium project and also reusing an existing Open-Source WYSIWYG web-editor[3] replacing its functionality for the interface, the post-editing tool will provide the user with a series of features that will facilitate the post-edition of translated texts. Furthermore, the Apertium's community has a couple of web-based interfaces on PHP up and running so code could also be reused and even the Google Web Development Kit could also speed up the development.

The post-edition tool allows to translate texts and edit them before Apertium's reformatter using a graphic web-interface and the Apertium stream format[4]. This is possible using existing des-formatter and re-formatter tools within the Apertium project[5].

The structure of the post-edition tool proposed also allows the installation and configuration of new features and linguistic resources[6] in different languages such as on-line dictionaries, spell and semantic checker (LanguageTool), synonymous, etc.

Once a text is translated and before Apertium's reformatter, there are still some issues in the translation which need manual edition with the post-edition tool. For example that is the case of untranslated words (absent from Apertium dictionaries) and ambiguity. Actually, Apertium optionally marks with a '*' words not found in Apertium dictionaries (unknown words) so they can be easily spotted in the resulting text. When there is no corresponding word for the target language, that word is kept the same way in the translation. Regarding ambiguity, Apertium offers the possibility of getting the list of alternative (due to ambiguity) translations in the output so the user can select the most appropriate alternative.

The list of linguistic resources can be quite vast. Therefore I propose to focus the task in three language pairs (en<>sp, en<>ca and sp<>ca) and include the feedback from Apertium users and mentors in order to define and prioritise the list of features to be added. The initial list of features and linguistic resources of my proposal is the following:

  1. Spell-checker: Adding this feature in the post-edition tool the mistypings in the source language could be detected and easily corrected. In front of a misspelling in the source language, the tool provide suggestions not only in the source language but also in the target language using Apertium bilingual dictionaries.
  2. Word translation: In case that the Spell-checker did not resolve an untranslated word using Apertium bilingual dictionaries it offers the possibility to use external resources for unknown words and even for translated words.
  3. Disambiguation: This is the case when Apertium engine gives us more than one alternative for a translated word. At this point the tool provides definitions of these words on-the-fly from external resources. And let the user select the best option suggesting a default option and more alternatives in a hidden menu. Furthermore, two different sources can be used from Apertium: information for alternative translations generated through dixes[7] or information of homographs coded in the dictionaries but not available in the general Apertium translation flow after the POS tagger[8], which has the ability to mark ambiguous words with an '='.
  4. Tracking/logging system: It allows to save log information about operations done with the post-edition tool by user. It keeps track of disambiguation, word translation/replacement, deletion and edition by user. This feature will allow to improve Apertium translation system and its dictionaties extracting and analyzing the content of logged information. That information could also be used for user dictionaries and the integration with tools such as the project Tradubi[9] in the future.
  5. Translation memory: The post-editing tool will generate a translation memory in TMX[10] format using the already available Apertium tools for that purpose [11][12].

How and who it will benefit in society

After translating with Apertium revision work has to be done to consider a translation as an "adequate" translation. An intelligent post-edition environment will help doing this task. In this environment some typical mistakes in the translation process that can be automatically detected (for example unknown words and homographs) could be highlighted to be taken in consideration while doing post-edition. Some typical mistakes could also be defined to advise the post-editor to check them.

Work plan

  • Week 1: Study of current post-edition solutions. It also includes the feedback from users of the Apertium platform.
  • Week 2: Specification of the Post-edition tool and its features (linguistic resources)
  • Week 3: Implementation of a basic Post-edition tool
  • Week 4: Integration of the Post-edition tool with Apertium

Deliverable #1 Goal: Integration of a basic Post-edition tool with Apertium

  • Week 5: Implementation of new features + Feedback with Apertium users
  • Week 6: Implementation of new features + Feedback
  • Week 7: Implementation of new features + Feedback
  • Week 8: Implementation of new features + Feedback

Deliverable #2 Goal: Integration of a Post-edition tool with Apertium and 100% features specified previously

  • Week 9: Testing
  • Week 10: Testing
  • Week 11: Documentation: User guide
  • Week 12: General documentation of the project

Project completed

The Work plan above includes 60 hours of study, analysis and specification of the project. 60 hours to develop the core of the application. 120 hours to develop features and extent core functionalities. 60 hours of testing. And finally 60 hours of documentation. Every block includes periodic feedback with Apertium users and the mentor of the project in order to keep track of the project.

List your skills and give evidence of your qualifications

I am Computer Scientist and Engineer by the Barcelona School of Informatics (www.fib.upc.edu). I also have a Postgraduate degree in Open-Source by the Technical University of Catalonia Foundation (www.fundacio.upc.edu)

Currently I am an student of Master in Business Administration by the IESE Business School in Barcelona. Previously I worked in an Open-Source company for two years. My aim doing the Master is to further develop my business administration skills in order to collaborate successfully to the Open-Source community from the private sector in the future.

I have strong background in Open-source projects. On one hand my final degree in the Barcelona School of Informatics was an Open-Source project which was awarded by the Catalonia Government and the Computer Science and Engineering Association of Catalonia. On the other hand I worked in an Open-Source company for two years where, among many other projects, we made the first migration of a Council to Open-Source in Catalonia. During these periods I gained excellent skills with Script Languages such as Perl or PHP and Web development.

List any non-Summer-of-Code plans you have for the Summer

Currently I am unemployed and looking forward to collaborate with the Summer-of-Code project before the Master begins again this September. Therefore I have full availability to participate in this task of the GSOC.

References

[1] Apertium project. Apertium-view. http://wiki.apertium.org/wiki/Apertium-view

[2] Apertium project. Apertium web services. http://wiki.apertium.org/wiki/Apertium_web_service

[3] Wikipedia. WYSIWYG. http://en.wikipedia.org/wiki/WYSIWYG

[4] Apertium project. Apertium stream format. http://wiki.apertium.org/wiki/Apertium_stream_format

[5] Apertium project. Format handling. http://wiki.apertium.org/wiki/Format_handling

[6] Apertium project. Using linguistic resources. http://wiki.apertium.org/wiki/Using_linguistic_resources

[7] Apertium project. Apertium-dixtools. http://wiki.apertium.org/wiki/Apertium-dixtools

[8] Apertium project. Tagger. http://wiki.apertium.org/wiki/Tagger

[9] Tradubi project. Tradubi. http://www.tradubi.com

[10] Wikipedia. Translation Memory eXchange. http://en.wikipedia.org/wiki/Translation_Memory_eXchange

[11] Apertium project. Tools for TMX. http://wiki.apertium.org/wiki/Tools_for_TMX

[12] SourceForge. bitex2tmx. http://bitext2tmx.sourceforge.net