Difference between revisions of "Google Summer of Code/Wrap-up Report 2009"
| m (mention mentor/administrator/student involvment in the FreeRBMT workshop) | |||
| (39 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| {{TOCD}} | {{TOCD}} | ||
| The [http://www.apertium.org Apertium Project] is a project which works on open-source machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with larger languages. To date, we have released translators for 21 language pairs, covering languages spoken by 1.1bn people, from English (est. 500m speakers) to Aranese (est. 4,000 speakers). A similar number of other language pairs are in development. The Apertium software is licensed under the GPL, but in addition (a rarer situation in the machine translation field) so is the '''data''' for all these language pairs.  This means that the data can be re-used by other language projects (e.g. in developing spelling or grammar checkers, thesauri, etc). | |||
| ⚫ | This was our first year in Google Summer of Code and we were very fortunate to receive nine student slots. We filled them with some great students and are pleased to report that out of the nine projects, eight were successful. Along with their end of project reports, students have also been invited to write papers along with their mentors for review in an [http://xixona.dlsi.ua.es/freerbmt09/ academic workshop] on free and open-source rule-based machine translation that we are organising with the mentors' money. | ||
| ⚫ | This was our first year in Google Summer of Code and we were very fortunate to receive nine student slots. We filled them with some great students and are pleased to report that out of the nine projects, eight were successful. Along with their end of project reports, students have also been invited to write papers along with their mentors for review in an [http://xixona.dlsi.ua.es/freerbmt09/ academic workshop] on free and open-source rule-based machine translation that we are organising with the mentors' money -- seven of the nine members of the programme committee were GSoC mentors or administrators with Apertium, and the three of the organisers were mentors (and one student). | ||
| ===A translator for Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)=== | ===A translator for Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)=== | ||
| Line 6: | Line 8: | ||
| This project was accepted as part of our "adopt a language pair" idea | This project was accepted as part of our "adopt a language pair" idea | ||
| from our ideas page. Some work had already been done on the translator | from our ideas page. Some work had already been done on the translator | ||
| but it was a long way from finished. [http://unhammer.wordpress.com/ Kevin Unhammer] from the | but it was a long way from finished. '''[http://unhammer.wordpress.com/ Kevin Unhammer]''' from the | ||
| University of Bergen was mentored by [http://www.hum.uit.no/a/trond/ Trond Trosterud] | University of Bergen was mentored by '''[http://www.hum.uit.no/a/trond/ Trond Trosterud]''' | ||
| from the University of Tromsø. The final result after an epic effort | from the University of Tromsø. The final result, after an epic effort, | ||
| is a working translator (indeed the first free software translator for nb-nn) that  | is a working translator (indeed the first free software translator for nb-nn) that  | ||
| makes a mistake in only 11 words out of every 100 | makes a mistake in only 11 words out of every 100 | ||
| translated, making using the system for post-edition feasible. | translated, making using the system for post-edition feasible. | ||
| One of the key aspects of Kevin's work was the re-use and adaptation of existing open source | |||
| resources. Much of the bilingual dictionary was statistically inferred from the existing | |||
| translations in [http://www.kde.org KDE], using [http://sourceforge.net/projects/retratos/ ReTraTos] and  | |||
| [http://www.fjoch.com/GIZA++.html GIZA++] (created by Franz Och, now a research scientist at Google Translate). | |||
| In addition to this, Kevin used the [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger Oslo-Bergen Constraint Grammer], contributing fixes not only to that, but to the [http://beta.visl.sdu.dk/cg3.html VISL CG3] software itself. After the GSoC deadline, Kevin has continued his work, including incorporating some changes from feedback from the [http://nn.wikipedia.org/wiki/Wikipedia:Wikiprosjekt_Maskinomsetjing_fr%C3%A5_bokm%C3%A5l/Manglande_omsetjingar Nynorsk Wikipedia]. | |||
| ===A translator for Swedish (sv) to Danish (da)=== | ===A translator for Swedish (sv) to Danish (da)=== | ||
| Another language pair adoption, Michael Kristensen, who had previously done some work on this translator was mentored by Jacob Nordfalk, the author of our English to Esperanto translator. As there are very few free linguistic resources for Swedish and Danish the work was pretty much started from scratch, although we took great advantage of the [http://sv.wiktionary.org Swedish Wiktionary]. The translator is only unidirectional from Swedish to Danish and it has an error rate of around 20%. | Another language pair adoption, '''Michael Kristensen''', who had previously done some work on this translator, was mentored by '''[http://javabog.dk/ Jacob Nordfalk]''', the author of our English to Esperanto translator. As there are very few free linguistic resources for Swedish and Danish the work was pretty much started from scratch, although we took great advantage of the [http://sv.wiktionary.org Swedish Wiktionary]. The translator is only unidirectional from Swedish to Danish and it has an error rate of around 20%. | ||
| The completion of this translator is something of a triumph for Apertium. Begun back in 2005, the project had been neglected for many years. This was the first translator for the Apertium platform that focussed on non-Romance languages, although Michael was forced to redo and correct much of the work that had been done. | |||
| ===Multi-engine machine translation=== | ===Multi-engine machine translation=== | ||
| '''Gabriel Synnaeve''' was mentored by '''Francis Tyers''' to work on a module to improve the quality of machine translation by taking translations from different systems and merging their strengths and discard weaknesses. The two systems focussed on in the initial prototype are [http://www.apertium.org Apertium] (rule-based MT) and [http://www.statmt.org/moses/ Moses] (statistical MT) but it can easily be extended to more. The [http://wiki.apertium.org/wiki/Multi-engine_translation_synthesiser#Multi-engine_pipeline idea] behind the system is that for some languages there is often not one MT system which is better than all others, but some are better at some phrases and some are better at others. Thus, if we can combine the output of two or more systems with different strengths/weaknesses, we can make better translations. | |||
| ⚫ | |||
| Perhaps the most exciting aspect of the MEMT project is its potential for use as a research platform for future work on hybrid machine translation, by allowing the researcher to focus only on the algorithms they wish to implement. During the project, Gabriel was joined by Francis in person for a 'mini-hackathon', which, despite something of a farcical start involving requests made on IRC for phone calls across Europe on behalf of two people who were in the same city, lead to a greater degree of functionality and modularisation in the code. | |||
| ===Highly scalable web service architecture for Apertium === | |||
| '''Víctor Manuel Sánchez Cartagena''' worked with mentor '''Juan Antonio Perez-Ortiz''' on a highly-scalable web service architecture, or, Apertium for [http://en.wikipedia.org/wiki/Cloud_computing Cloud computing]. Initially targeting Amazon's [http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud EC2], as well as standalone servers, the scalable web service allows the use of multiple translation services on multiple physical or virtual servers, scaling to meet the translation demands of users, from a single user-facing service, which implements the [http://code.google.com/apis/ajaxlanguage/ Google Language API]. | |||
| The core of the system is the translation router, which controls the flow between user and translation server, based on a variety of factors, including the availability of the language pair, the current load on the server, as well as providing a framework to allow these factors to have different priorities on a per-user basis. When used on Amazon EC2, it also takes into account the cost of each translation request. The project is a complete package; as well as the router, it includes a translation daemon, and convenience scripts to ease the rollout of server instances. | |||
| In addition to his work on his project, Víctor is also serving as an organiser for the FreeRBMT workshop. | |||
| ===Conversion of Anubadok=== | ===Conversion of Anubadok=== | ||
| '''Abu Zaher''' was mentored by '''Kevin Donnelly''' and '''[http://xixona.dlsi.ua.es/~fran/ Francis Tyers]''' to convert [http://anubadok.sourceforge.net/ Anubadok], an open-source MT system for English to [http://en.wikipedia.org/wiki/Bengali_language Bengali] to work with the Apertium engine. This was an ambitious project and not all of the goals were realised, but we were able to make the first wide-coverage [http://en.wikipedia.org/wiki/Morphology_(linguistics) morphological analyser / generator] for Bengali and a substantial amount of lexical transfer, so the project was a great success. | |||
| ===Apertium scaleable architecture=== | |||
| Zaher is also looking at improving the [http://ankur.org.bd/wiki/Documentation#Bangla_Spell_Checking_How-to Ankur] spell checker with information from his analyser / generator, so the work done is already being reused; there is also interest in using the data to create a Bengali stemmer, for more efficient searching/indexing of Bengali texts, and a number of tools which were created to model the various aspects of Bengali inflection will certainly prove useful in other areas of NLP for Bengali. | |||
| ⚫ | |||
| '''Pasquale Minervini''' worked with '''Jimmy O'Regan''' on the project ''Apertium going SOA''. Pasquale's work was motivated by the needs of [http://informaticisenzafrontiere.org/indexen.php Informatici senza Frontiere] to have a translation engine that would fit into a Service-Oriented architecture. To this end, Pasquale designed an [http://en.wikipedia.org/wiki/XML-RPC XML-RPC]-based server that efficiently contains the Apertium pipeline, and layered it with [http://en.wikipedia.org/wiki/JSON JSON], [http://en.wikipedia.org/wiki/SOAP SOAP], and [http://en.wikipedia.org/wiki/CORBA CORBA] services, which, as well as making Apertium more [http://en.wikipedia.org/wiki/Buzzword_compliant buzzword compliant], gives a greater range of options to programmers wishing to integrate Apertiums translation services into a wider range of architectures. This is undoubtedly a popular project idea: [http://alexa.com/siteinfo/apertium.org#keywords Alexa's keywords] for Apertium show 'apertium going soa' and 'deadbeef apertium' (deadbeef is Pasquale's IRC nick) in 2nd and 4th place for search keywords leading to Apertium. | |||
| Because of the potential overlap between their projects, in the first weeks of their GSoC work, Pasquale and Víctor agreed on the [http://code.google.com/apis/ajaxlanguage/ Google Language API] as a standard for their projects to communicate; Pasquale took this agreement one step further by implementing the 'language detection' feature of the API - something previously unavailable in Apertium. In addition to that, Pasquale also contributed memory leak checks against the Apertium platform, as well as other fixes, and has helped another (non-GSoC) student in the goal of porting Apertium to Windows. | |||
| ===Trigram part-of-speech tagging=== | ===Trigram part-of-speech tagging=== | ||
| Zaid Md. Abdul Wahab Sheikh was mentored by [http://www.dlsi.ua.es/~fsanchez/ Felipe Sánchez Martínez] to improve our [http://en.wikipedia.org/wiki/Part-of-speech_tagging part-of-speech tagging] module to use [http://en.wikipedia.org/wiki/Trigram trigrams] instead of [http://en.wikipedia.org/wiki/Bigram bigrams] | '''Zaid Md. Abdul Wahab Sheikh''' was mentored by '''[http://www.dlsi.ua.es/~fsanchez/ Felipe Sánchez Martínez]''' to improve our [http://en.wikipedia.org/wiki/Part-of-speech_tagging part-of-speech tagging] module to use [http://en.wikipedia.org/wiki/Trigram trigrams] instead of [http://en.wikipedia.org/wiki/Bigram bigrams], as well as implementing changes to the training tools to create data for it. | ||
| Apertium was originally designed for closely related languages, but is increasingly growing to meet the challenges of translating between more distant languages. One of the unique aspects of Dr. Sanchez's work on Part-of-Speech tagging is the use of target language information to train the tagger, which allows an accurate tagger to be trained using much less data than usual, provided that it is trained on bilingual text. Zaid's work builds on Dr. Sanchez's work with first-order [http://en.wikipedia.org/wiki/Hidden_Markov_model Hidden Markov Models], extending it to second-order HMMs, similarly to [http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT]. This enables more accurate translation between more distant languages, using the same methods, so that the rest of the Apertium system can continue to grow. | |||
| ===Java port of lttoolbox=== | ===Java port of lttoolbox=== | ||
| '''Raphaël Laurent''' worked with '''Sergio Ortiz Rojas''' to port [http://wiki.apertium.org/wiki/Lttoolbox lttoolbox] to Java. lttoolbox is the core | |||
| Raphaël Laurent worked with Sergio Ortiz Rojas to port our [http://wiki.apertium.org/wiki/Lttoolbox lexical processing tools] from C++ to Java. This will facilitate the re-use of our software and our [http://wiki.apertium.org/wiki/List_of_dictionaries extensive repository] of morphological analysers. The project was a success, the finite-state compiler works and generates binaries in the same form as the version in C++ making the binary finite-state transducers interchangeable between versions, there are room for speed improvements, currently the Java version runs ~6 times slower than the C++ one. | |||
| component of the Apertium system; as well as providing [http://en.wikipedia.org/wiki/Morphological_dictionary morphological analysis] and generation, | |||
| it also provides pattern matching and dictionary lookup to the rest of Apertium, so a Java port is the first step towards a version of Apertium for Java-based devices. Raphaël finished an earlier line-for-line port contributed by Nic Cotrell, first making it work; then making it binary compatible. | |||
| As it stands currently, lttoolbox-java can be integrated into other Java-based tools, facilitating the re-use of our software and our [http://wiki.apertium.org/wiki/List_of_dictionaries extensive repository] of morphological analysers. Tools such as [http://www.languagetool.org LanguageTool], the open source proofreading tool, also make extensive use of morphological analysis, but [http://www.omegat.org OmegaT], the open source [http://en.wikipedia.org/wiki/Computer-assisted_translation CAT] tool, could use it for dictionary look-up of inflected words; it could even be used with our own [http://sourceforge.net/project/shownotes.php?release_id=614051 apertium-morph] tool: a plugin for [http://lucene.apache.org Lucene] that allows linguistically-rich document indexing. | |||
Latest revision as of 15:54, 12 September 2009
The Apertium Project is a project which works on open-source machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with larger languages. To date, we have released translators for 21 language pairs, covering languages spoken by 1.1bn people, from English (est. 500m speakers) to Aranese (est. 4,000 speakers). A similar number of other language pairs are in development. The Apertium software is licensed under the GPL, but in addition (a rarer situation in the machine translation field) so is the data for all these language pairs. This means that the data can be re-used by other language projects (e.g. in developing spelling or grammar checkers, thesauri, etc).
This was our first year in Google Summer of Code and we were very fortunate to receive nine student slots. We filled them with some great students and are pleased to report that out of the nine projects, eight were successful. Along with their end of project reports, students have also been invited to write papers along with their mentors for review in an academic workshop on free and open-source rule-based machine translation that we are organising with the mentors' money -- seven of the nine members of the programme committee were GSoC mentors or administrators with Apertium, and the three of the organisers were mentors (and one student).
A translator for Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)[edit]
This project was accepted as part of our "adopt a language pair" idea from our ideas page. Some work had already been done on the translator but it was a long way from finished. Kevin Unhammer from the University of Bergen was mentored by Trond Trosterud from the University of Tromsø. The final result, after an epic effort, is a working translator (indeed the first free software translator for nb-nn) that makes a mistake in only 11 words out of every 100 translated, making using the system for post-edition feasible.
One of the key aspects of Kevin's work was the re-use and adaptation of existing open source resources. Much of the bilingual dictionary was statistically inferred from the existing translations in KDE, using ReTraTos and GIZA++ (created by Franz Och, now a research scientist at Google Translate). In addition to this, Kevin used the Oslo-Bergen Constraint Grammer, contributing fixes not only to that, but to the VISL CG3 software itself. After the GSoC deadline, Kevin has continued his work, including incorporating some changes from feedback from the Nynorsk Wikipedia.
A translator for Swedish (sv) to Danish (da)[edit]
Another language pair adoption, Michael Kristensen, who had previously done some work on this translator, was mentored by Jacob Nordfalk, the author of our English to Esperanto translator. As there are very few free linguistic resources for Swedish and Danish the work was pretty much started from scratch, although we took great advantage of the Swedish Wiktionary. The translator is only unidirectional from Swedish to Danish and it has an error rate of around 20%.
The completion of this translator is something of a triumph for Apertium. Begun back in 2005, the project had been neglected for many years. This was the first translator for the Apertium platform that focussed on non-Romance languages, although Michael was forced to redo and correct much of the work that had been done.
Multi-engine machine translation[edit]
Gabriel Synnaeve was mentored by Francis Tyers to work on a module to improve the quality of machine translation by taking translations from different systems and merging their strengths and discard weaknesses. The two systems focussed on in the initial prototype are Apertium (rule-based MT) and Moses (statistical MT) but it can easily be extended to more. The idea behind the system is that for some languages there is often not one MT system which is better than all others, but some are better at some phrases and some are better at others. Thus, if we can combine the output of two or more systems with different strengths/weaknesses, we can make better translations.
Perhaps the most exciting aspect of the MEMT project is its potential for use as a research platform for future work on hybrid machine translation, by allowing the researcher to focus only on the algorithms they wish to implement. During the project, Gabriel was joined by Francis in person for a 'mini-hackathon', which, despite something of a farcical start involving requests made on IRC for phone calls across Europe on behalf of two people who were in the same city, lead to a greater degree of functionality and modularisation in the code.
Highly scalable web service architecture for Apertium[edit]
Víctor Manuel Sánchez Cartagena worked with mentor Juan Antonio Perez-Ortiz on a highly-scalable web service architecture, or, Apertium for Cloud computing. Initially targeting Amazon's EC2, as well as standalone servers, the scalable web service allows the use of multiple translation services on multiple physical or virtual servers, scaling to meet the translation demands of users, from a single user-facing service, which implements the Google Language API.
The core of the system is the translation router, which controls the flow between user and translation server, based on a variety of factors, including the availability of the language pair, the current load on the server, as well as providing a framework to allow these factors to have different priorities on a per-user basis. When used on Amazon EC2, it also takes into account the cost of each translation request. The project is a complete package; as well as the router, it includes a translation daemon, and convenience scripts to ease the rollout of server instances.
In addition to his work on his project, Víctor is also serving as an organiser for the FreeRBMT workshop.
Conversion of Anubadok[edit]
Abu Zaher was mentored by Kevin Donnelly and Francis Tyers to convert Anubadok, an open-source MT system for English to Bengali to work with the Apertium engine. This was an ambitious project and not all of the goals were realised, but we were able to make the first wide-coverage morphological analyser / generator for Bengali and a substantial amount of lexical transfer, so the project was a great success.
Zaher is also looking at improving the Ankur spell checker with information from his analyser / generator, so the work done is already being reused; there is also interest in using the data to create a Bengali stemmer, for more efficient searching/indexing of Bengali texts, and a number of tools which were created to model the various aspects of Bengali inflection will certainly prove useful in other areas of NLP for Bengali.
Apertium going SOA[edit]
Pasquale Minervini worked with Jimmy O'Regan on the project Apertium going SOA. Pasquale's work was motivated by the needs of Informatici senza Frontiere to have a translation engine that would fit into a Service-Oriented architecture. To this end, Pasquale designed an XML-RPC-based server that efficiently contains the Apertium pipeline, and layered it with JSON, SOAP, and CORBA services, which, as well as making Apertium more buzzword compliant, gives a greater range of options to programmers wishing to integrate Apertiums translation services into a wider range of architectures. This is undoubtedly a popular project idea: Alexa's keywords for Apertium show 'apertium going soa' and 'deadbeef apertium' (deadbeef is Pasquale's IRC nick) in 2nd and 4th place for search keywords leading to Apertium.
Because of the potential overlap between their projects, in the first weeks of their GSoC work, Pasquale and Víctor agreed on the Google Language API as a standard for their projects to communicate; Pasquale took this agreement one step further by implementing the 'language detection' feature of the API - something previously unavailable in Apertium. In addition to that, Pasquale also contributed memory leak checks against the Apertium platform, as well as other fixes, and has helped another (non-GSoC) student in the goal of porting Apertium to Windows.
Trigram part-of-speech tagging[edit]
Zaid Md. Abdul Wahab Sheikh was mentored by Felipe Sánchez Martínez to improve our part-of-speech tagging module to use trigrams instead of bigrams, as well as implementing changes to the training tools to create data for it.
Apertium was originally designed for closely related languages, but is increasingly growing to meet the challenges of translating between more distant languages. One of the unique aspects of Dr. Sanchez's work on Part-of-Speech tagging is the use of target language information to train the tagger, which allows an accurate tagger to be trained using much less data than usual, provided that it is trained on bilingual text. Zaid's work builds on Dr. Sanchez's work with first-order Hidden Markov Models, extending it to second-order HMMs, similarly to TnT. This enables more accurate translation between more distant languages, using the same methods, so that the rest of the Apertium system can continue to grow.
Java port of lttoolbox[edit]
Raphaël Laurent worked with Sergio Ortiz Rojas to port lttoolbox to Java. lttoolbox is the core component of the Apertium system; as well as providing morphological analysis and generation, it also provides pattern matching and dictionary lookup to the rest of Apertium, so a Java port is the first step towards a version of Apertium for Java-based devices. Raphaël finished an earlier line-for-line port contributed by Nic Cotrell, first making it work; then making it binary compatible.
As it stands currently, lttoolbox-java can be integrated into other Java-based tools, facilitating the re-use of our software and our extensive repository of morphological analysers. Tools such as LanguageTool, the open source proofreading tool, also make extensive use of morphological analysis, but OmegaT, the open source CAT tool, could use it for dictionary look-up of inflected words; it could even be used with our own apertium-morph tool: a plugin for Lucene that allows linguistically-rich document indexing.

