Difference between revisions of "User:Littleowl/Littleowl pet"
Jump to navigation
Jump to search
(removed user details) |
|||
Line 1: | Line 1: | ||
'''Apertium: Post-editing tool''' |
|||
[http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/littleowl/t127071866711 GSoC on-line application] |
|||
== Abstract == |
|||
My proposal is to create a web-interface (Post-editing tool) from an existing web development/viewer and web services of Apertium project, and also reusing an existing Open-Source WYSIWYG web-editor replacing its functionality for the interface or using an existing API. |
|||
It contains a list of features and linguistic resources in order to facilitate the post-editing of translated texts such as spell checkers, on-line dictionaries, etc. The list is subjected to the feedback of Apertium users. |
|||
== Content == |
|||
'''Name''': Carles Sanz Casañas |
|||
'''E-mail address''': carles.sanz@pangea.org |
|||
'''Other information that may be useful to contact you''': |
|||
* Cell phone: +34 605064191 |
|||
* SourceForge: http://littleowl.users.sourceforge.net/ |
|||
* IRC user: littleowl |
|||
* Wiki user: [[User:Littleowl]] |
|||
=== Why is it you are interested in machine translation? === |
|||
I live in Catalonia where there is two official languages, Catalan and Spanish. Therefore, documentation is always in either Catalan or Spanish or even in English for international purposes. I believe that Machine Translation Systems are key tools in order to improve the communication within and between Organizations in Catalonia and over the World. |
|||
=== Why is it that they are interested in the Apertium project? === |
|||
I am really interested in Apertium because is an open-source platform for the purpose mentioned above. And I also like the democratic spirit of open-source projects I would be very excited to take the opportunity to collaborate on this kind of project. |
|||
=== Which of the published tasks are you interested in? What do you plan to do? === |
|||
==== Title ==== |
|||
Post-editing tool |
|||
==== Why Google and Apertium should sponsor it ==== |
|||
It is a tool to speed up revision of Apertium translations, therefore it will reduce the cost and time of many translations. |
|||
From an existing web development/viewer[1] and web services[2] of Apertium project or even reusing an existing Open-Source WYSIWYG web-editor[3] replacing its functionality for the interface, the post-editing tool will provide to the user a list of features that will facilitate the editing of translated texts. Furthermore, the Apertium's community has a couple of web-based interfaces on PHP up and running so code could also be reused and even the Google Web Development Kit could also speed up the development. |
|||
The post-editing tool allows to translate texts and edit them before Apertium's reformatter using a graphic web-interface and the Apertium stream format[4]. This is possible using existing des-formatter and re-formatter tools within the Apertium project[5]. |
|||
The structure of the post-editing tool proposed also allows the installation and configuration of new features and linguistic resources[6] in different languages such as on-line dictionaries, spell and semantic checker (LanguageTool), synonymous, etc. |
|||
Once a text is translated and before Apertium's reformatter, there are still some issues in the translation which need manual edition with the post-editing tool. For example that is the case of untranslated words (absent from Apertium dictionaries) and ambiguity. Actually, Apertium optionally marks with a '*' words not found in Apertium dictionaries (unknown words) so they can be easily spotted in the resulting text. When there is no corresponding word for the target language, that word is kept the same way in the translation. Regarding ambiguity, Apertium offers the possibility of getting the list of alternative (due to ambiguity) translations in the output so the user can select the most appropriate alternative. |
|||
The list of linguistic resources can be quite vast. Therefore I propose to focus the task in three language pairs (en<>sp, en<>ca and sp<>ca) and to include the feedback from Apertium users and mentors in order to define and prioritize the list of features to be added. The initial list of features and linguistic resources of my proposal is the following: |
|||
# '''Spell-checker''': Adding this feature in the post-editing tool the mistypings in the source language could be detected and easily corrected. In front of a misspelling in the source language, the tool provide suggestions not only in the source language but also in the target language using Apertium bilingual dictionaries. |
|||
# '''Word translation''': In case that the Spell-checker did not resolve an untranslated word using Apertium bilingual dictionaries it offers the possibility to use external resources for unknown words and even for any translated words. |
|||
# '''Disambiguation''': This is the case when Apertium engine gives us more than one alternative for a translated word. The ambiguity is incorporated in the translation flow by the POS tagger[7] which marks ambiguous words with an '=' symbol. Although this ambiguity does not reach now the output, it is possible to send this information to the output of the system. With this information in the system output, the tool would be able to provide definitions of these words on-the-fly from external resources. And let the user select the best option suggesting a default option and more alternatives in a hidden menu. Furthermore, two different sources can be used from Apertium: information for alternative translations generated through dixes[8] or information of homographs coded in the dictionaries but not available in the general Apertium translation flow. |
|||
# '''Tracking/logging system''': It allows to save log information about operations done with the post-editing tool by user. It keeps track of disambiguation, word translation/replacement, deletion and edition by user. This feature will allow to improve Apertium translation system and its dictionaties extracting and analyzing the content of logged information. That information could also be used for user dictionaries and the integration with tools such as the project Tradubi[9] in the future. |
|||
# '''Translation memory generation''': The post-editing tool will generate a translation memory in TMX[10] format using the already available Apertium tools for that purpose [11][12]. The translated memory will be generated from the original document sent to Apertium engine and the final post-edited text. |
|||
==== How and who it will benefit in society ==== |
|||
After translating with Apertium revision work has to be done to consider a translation as an "adequate" translation. An intelligent post-editing environment will help doing this task. In this environment some typical mistakes in the translation process that can be automatically detected (for example unknown words and homographs) could be highlighted to be taken in consideration while doing post-editing. Some typical mistakes could also be defined to advise the post-editor to check them. |
|||
==== Work plan ==== |
|||
* '''Bonding period''': (four weeks) Study of current post-editing solutions and also specification of the Post-editing tool and its features (linguistic resources). It includes the feedback from users of the Apertium platform. |
|||
<blockquote> |
|||
Study of Apertium resources regarding existing web-developments. Some Apertium users mentioned a couple of web-based PHP interfaces or even the use of the Google Web Development Kit. Also my initial proposal to use an existing Open-Source development. For this matter, feedback from Apertium Community will be useful. |
|||
</blockquote> |
|||
<blockquote> |
|||
Study of existing Apertium resources and how to use them through web-services. Apertium bilingual dictionaries, LanguageTool, TMX tools, etc. Also the study of external resources (dictionaries) for the three language pairs selected for this task. It will produce the plan for the development of both internal and external resources within the Post-editing tool. |
|||
</blockquote> |
|||
* '''Week 1''': (start coding) Implementation of a basic Post-editing tool |
|||
<blockquote> |
|||
Initial development of the Post-editing tool building its structure and using Apertium development resources (SVN). |
|||
</blockquote> |
|||
* '''Week 2''': Post-editing tool and its tracking system by user |
|||
<blockquote> |
|||
Development of the Tracking/logging system (feature 4). Basicly user authentication and its structure. The tracking tokens will be fulfilled for each feature. |
|||
</blockquote> |
|||
* '''Week 3''': Integration of the Post-editing tool with Apertium |
|||
<blockquote> |
|||
Integration and testing of internal and external resources using web-services. It does not include development of full features. It will produce the bases for full development next weeks. Basicly it produces the engine of the tool where features must be added. |
|||
</blockquote> |
|||
* '''Week 4''': Refine and bug fixing of the Post-editing tool engine |
|||
<blockquote> |
|||
Testing and feedback from Apertium Community. Start coding feature 1 (Spell-checker). |
|||
</blockquote> |
|||
'''Deliverable #1 Goal''': Integration of a basic Post-editing tool with Apertium |
|||
* '''Week 5''': Spell-checker |
|||
<blockquote> |
|||
Once the structure of the Post-editing tool has been built, the development will focus in the list of features starting with the Spell-checker. (feature 1) |
|||
</blockquote> |
|||
* '''Week 6''': External resources |
|||
<blockquote> |
|||
Integration of external resources, dictionaries, for word translation and also disambiguation (features 2 & 3). At the end of this week features 1 & 2 are completed (completion includes tracking/logging details for both features). |
|||
</blockquote> |
|||
* '''Week 7''': Disambiguation |
|||
<blockquote> |
|||
This week is for disambiguation (feature 3) and its completion which includes tracking/logging details. |
|||
</blockquote> |
|||
* '''Week 8''': Translation Memory Generation |
|||
<blockquote> |
|||
Integration of the Translation memory generation feature. (feature 5) |
|||
</blockquote> |
|||
'''Deliverable #2 Goal''': Integration of a Post-editing tool with Apertium and 100% features specified previously |
|||
* '''Week 9''': Testing |
|||
<blockquote> |
|||
Period to test the Post-editing tool and to fix bugs. The feedback from Apertium Community will be important. |
|||
</blockquote> |
|||
* '''Week 10''': Testing |
|||
<blockquote> |
|||
Period to test the Post-editing tool and to fix bugs. The feedback from Apertium Community will be important. |
|||
</blockquote> |
|||
* '''Week 11''': Documentation: User guide |
|||
<blockquote> |
|||
User guide for all features and the web interface. |
|||
</blockquote> |
|||
* '''Week 12''': General documentation of the project |
|||
<blockquote> |
|||
Documentation of the project. It will include previous study (weeks 1&2), development tips, the User Guide, the cost of the project and future of the application. |
|||
</blockquote> |
|||
'''Project completed''' |
|||
The Work plan above includes 4 weeks of study during the bonding period, analysis and specification of the project. Another 4 weeks to develop the core of the application, its engine. Another 4 weeks to develop features and extent core functionalities. 2 weeks of testing. And finally 2 weeks of documentation. Every block includes periodic feedback with Apertium users and the mentor of the project in order to keep track of the project. |
|||
=== List your skills and give evidence of your qualifications === |
|||
I am Computer Scientist and Engineer by the Barcelona School of Informatics (www.fib.upc.edu). I also have a Postgraduate degree in Open-Source by the Technical University of Catalonia Foundation (www.fundacio.upc.edu) |
|||
Currently I am an student of Master in Business Administration by the IESE Business School in Barcelona. Previously I worked in an Open-Source company for two years. My aim doing the Master is to further develop my business administration skills in order to collaborate successfully to the Open-Source community from the private sector in the future. |
|||
I have strong background in Open-source projects. On one hand my final degree in the Barcelona School of Informatics was an Open-Source project which was awarded by the Catalonia Government and the Computer Science and Engineering Association of Catalonia. On the other hand I worked in an Open-Source company for two years where, among many other projects, we made the first migration of a Council to Open-Source in Catalonia. During these periods I gained excellent skills with Script Languages such as Perl or PHP and Web development. |
|||
=== List any non-Summer-of-Code plans you have for the Summer === |
|||
Currently I am unemployed and looking forward to collaborate with the Summer-of-Code project before the Master begins again this September. Therefore I have full availability to participate in this task of the GSOC. |
|||
=== References === |
|||
[1] Apertium project. ''Apertium-view''. http://wiki.apertium.org/wiki/Apertium-view |
|||
[2] Apertium project. ''Apertium web services''. http://wiki.apertium.org/wiki/Apertium_web_service |
|||
[3] Wikipedia. ''WYSIWYG''. http://en.wikipedia.org/wiki/WYSIWYG |
|||
[4] Apertium project. ''Apertium stream format''. http://wiki.apertium.org/wiki/Apertium_stream_format |
|||
[5] Apertium project. ''Format handling''. http://wiki.apertium.org/wiki/Format_handling |
|||
[6] Apertium project. ''Using linguistic resources''. http://wiki.apertium.org/wiki/Using_linguistic_resources |
|||
[7] Apertium project. ''Tagger''. http://wiki.apertium.org/wiki/Tagger |
|||
[8] Apertium project. ''Apertium-dixtools''. http://wiki.apertium.org/wiki/Apertium-dixtools |
|||
[9] Tradubi project. ''Tradubi''. http://www.tradubi.com |
|||
[10] Wikipedia. ''Translation Memory eXchange''. http://en.wikipedia.org/wiki/Translation_Memory_eXchange |
|||
[11] Apertium project. ''Tools for TMX''. http://wiki.apertium.org/wiki/Tools_for_TMX |
|||
[12] SourceForge. ''bitex2tmx''. http://bitext2tmx.sourceforge.net |