Difference between revisions of "User:Littleowl/Littleowl pet"

From Apertium
Jump to navigation Jump to search
m (formatting)
Line 85: Line 85:


The Work plan above includes 60 hours of study, analysis and specification of the project. 60 hours to develop the core of the application. 120 hours to develop features and extent core functionalities. 60 hours of testing. And finally 60 hours of documentation. Every block includes periodic feedback with Apertium users and the mentor of the project in order to keep track of the project.
The Work plan above includes 60 hours of study, analysis and specification of the project. 60 hours to develop the core of the application. 120 hours to develop features and extent core functionalities. 60 hours of testing. And finally 60 hours of documentation. Every block includes periodic feedback with Apertium users and the mentor of the project in order to keep track of the project.



'''List your skills and give evidence of your qualifications,'''
'''List your skills and give evidence of your qualifications,'''

Revision as of 18:23, 8 April 2010

Apertium: Post-edition tool

GSoC on-line application

Abstract

My proposal is to create a web-interface (Post-edition tool) from an existing web development/viewer and web services of Apertium project, and also reusing an existing Open-Source WYSIWYG web-editor replacing its functionality for the interface or using an existing API.

It contains a list of features and linguistic resources in order to facilitate the post-edition of translated texts such as spell checkers, on-line dictionaries, etc. The list is subjected to the feedback of Apertium users. Content:


Content

Name: Carles Sanz Casañas

E-mail address: carles.sanz@pangea.org

Other information that may be useful to contact you:


Why is it you are interested in machine translation?

I live in Catalonia where there is two official languages, Catalan and Spanish. Therefore, documentation is always in either Catalan or Spanish or even in English for international purposes. I believe that Machine Translation Systems are key tools in order to improve the communication within and between Organizations in Catalonia and over the World.


Why is it that they are interested in the Apertium project?

I am really interested in Apertium because is an open-source platform for the purpose mentioned above. And I also like the democratic spirit of open-source projects I would be very excited to take the opportunity to collaborate on this kind of project.


Which of the published tasks are you interested in? What do you plan to do?

Title

Post-edition tool

Why Google and Apertium should sponsor it,

It is a tool to speed up revision of Apertium translations, therefore it will reduce the cost and time of many translations.

From an existing web development/viewer[1] and web services[2] of Apertium project and also reusing an existing Open-Source WYSIWYG web-editor[3] replacing its functionality for the interface, the post-edition tool adds a list of features in order to facilitate the post-edition of translated texts. Furthermore, the Apertium's community has a couple of web-interface based on PHP up and running so code could also be reused and even the Google Web Development Kit could also speed up the development.

The post-edition tool allows to translate texts and edit them before Apertium's reformatter using a graphic web-interface and the Apertium stream format[4]. This is possible using existing des-formatter and re-formatter tools within the Apertium project[5].

The structure of the post-edition tool proposed also allows the installation and configuration of new features and linguistic resources[6] in different languages such as on-line dictionaries, spell and semantic checker (LanguageTool), synonymous, etc.

Once a text is translated and before Apertium's reformatter, there is still some mistakes in the translation which need manual edition with the post-edition tool. For example that is the case of untranslated words, unknown words within Apertium dictionaries and ambiguity. Actually, Apertium optionally marks with a '*' words not found in Apertium dictionaries (unknown words) so they can be easily spotted in the resulting text. When there is no corresponding word for the target language, that word is kept the same way in the translation. Regarding ambiguity, Apertium offers the possibility of getting the list of alternative (due to ambiguity) translations in the output so the user can select the most appropriate alternative.

The list of linguistic resources can be quite vast. Therefore I propose to focus the task in three languages only (en, sp, ca) and include the feedback from Apertium users and mentors in order to define and prioritise the list of features to be added. The initial list of features and linguistic resources of my proposal is the following:

  1. Spell-checker: Adding this feature in the post-edition tool the mistypings in the source language could be detected and easily corrected. In front of a misspelling in the source language, the tool provide suggestions not only in the source language but also in the target language using Apertium bilingual dictionaries. It can also be done using external dictionaries in case of unknown words within Apertium resources.
  2. Disambiguation: This is the case when Apertium engine give us more than one alternative for a translated word. At this point the tool provides definitions of these words on-the-fly from Apertium and external resources. And let the user select the best option suggesting a default option and more alternatives in a hidden menu. Furthermore, two different sources can be used from Apertium: information for alternative translations generated through dixes[7] or information of homographs coded in the dictionaries but not available in the general Apertium translation flow after the POS tagger[8], which has the ability to mark ambiguous words with an '='.
  3. Word translation: When the spelling of an original word (source language) has been checked and amended, the user needs to translate it. In front of a misspelling in the source language it directly suggest the translation of the intended word in the source language. It also can be done using Apertium resources or external dictionaries.
  4. Tracking/logging system: It allows to save log information about operations done with the post-edition tool by user. It keeps track of disambiguation, word translation/replacement, deletion and edition by user. This feature will allow to improve Apertium translation system and its dictionaties extracting and analyzing the content of logged information. That information could also be used for user dictionaries and the integration with tools such as the project Tradubi[9] in the future.
  5. Translation memory: From the original source and the post-edited translation with the post-edition tool generates an TMX[10] output file using Apertium tools for TMX[11] and the bitex2tmx[12].

How and who it will benefit in society,

After translating with Apertium revision work has to be done to consider a translation as an "adequate" translation. An intelligent post-edition environment will help doing this task. In this environment some typical mistakes in the translation process that can be automatically detected (for example unknown words and homographs) could be highlighted to be taken in consideration while doing post-edition. Some typical mistakes could also be defined to advise the post-editor to check them.

Work plan

  • Week 1: Study of current post-edition solutions. It also includes the feedback from users of the Apertium platform.
  • Week 2: Specification of the Post-edition tool and its features (linguistic resources)
  • Week 3: Implementation of a basic Post-edition tool
  • Week 4: Integration of the Post-edition tool with Apertium
  • Deliverable #1 Goal: Integration of a basic Post-edition tool with Apertium
  • Week 5: Implementation of new features + Feedback with Apertium users
  • Week 6: Implementation of new features + Feedback
  • Week 7: Implementation of new features + Feedback
  • Week 8: Implementation of new features + Feedback
  • Deliverable #2 Goal: Integration of a Post-edition tool with Apertium and 100% features specified previously
  • Week 9: Testing
  • Week 10: Testing
  • Week 11: Documentation: User guide
  • Week 12: General documentation of the project
  • Project completed

The Work plan above includes 60 hours of study, analysis and specification of the project. 60 hours to develop the core of the application. 120 hours to develop features and extent core functionalities. 60 hours of testing. And finally 60 hours of documentation. Every block includes periodic feedback with Apertium users and the mentor of the project in order to keep track of the project.


List your skills and give evidence of your qualifications,

I am Computer Scientist and Engineer by the Barcelona School of Informatics (www.fib.upc.edu). I also have a Postgraduate degree in Open-Source by the Technical University of Catalonia Foundation (www.fundacio.upc.edu)

Currently I am an student of Master in Business Administration by the IESE Business School in Barcelona. Previously I worked in an Open-Source company for two years. My aim doing the Master is to further develop my business administration skills in order to collaborate successfully to the Open-Source community from the private sector in the future.

I have strong background in Open-source projects. On one hand my final degree in the Barcelona School of Informatics was an Open-Source project which was awarded by the Catalonia Government and the Computer Science and Engineering Association of Catalonia. On the other hand I worked in an Open-Source company for two years where, among many other projects, we made the first migration of a Council to Open-Source in Catalonia. During these periods I gained excellent skills with Script Languages such as Perl or PHP and Web development.


List any non-Summer-of-Code plans you have for the Summer,

Currently I am unemployed and looking forward to collaborate with the Summer-of-Code project before the Master begins again this September. Therefore I have full availability to participate in this task of the GSOC.


References

[1] Apertium project. Apertium-view. http://wiki.apertium.org/wiki/Apertium-view

[2] Apertium project. Apertium web services. http://wiki.apertium.org/wiki/Apertium_web_service

[3] Wikipedia. WYSIWYG. http://en.wikipedia.org/wiki/WYSIWYG

[4] Apertium project. Apertium stream format. http://wiki.apertium.org/wiki/Apertium_stream_format

[5] Apertium project. Format handling. http://wiki.apertium.org/wiki/Format_handling

[6] Apertium project. Using linguistic resources. http://wiki.apertium.org/wiki/Using_linguistic_resources

[7] Apertium project. Apertium-dixtools. http://wiki.apertium.org/wiki/Apertium-dixtools

[8] Apertium project. Tagger. http://wiki.apertium.org/wiki/Tagger

[9] Tradubi project. Tradubi. http://www.tradubi.com

[10] Wikipedia. Translation Memory eXchange. http://en.wikipedia.org/wiki/Translation_Memory_eXchange

[11] Apertium project. Tools for TMX. http://wiki.apertium.org/wiki/Tools_for_TMX

[12] SourceForge. bitex2tmx. http://bitext2tmx.sourceforge.net