Difference between revisions of "Google Summer of Code/Wrap-up Report 2009"
Line 8: | Line 8: | ||
but it was a long way from finished. [http://unhammer.wordpress.com/ Kevin Unhammer] from the |
but it was a long way from finished. [http://unhammer.wordpress.com/ Kevin Unhammer] from the |
||
University of Bergen was mentored by [http://www.hum.uit.no/a/trond/ Trond Trosterud] |
University of Bergen was mentored by [http://www.hum.uit.no/a/trond/ Trond Trosterud] |
||
from the University of Tromsø. The final result after an epic effort |
from the University of Tromsø. The final result, after an epic effort, |
||
is a working translator (indeed the first free software translator for nb-nn) that |
is a working translator (indeed the first free software translator for nb-nn) that |
||
makes a mistake in only 11 words out of every 100 |
makes a mistake in only 11 words out of every 100 |
||
translated, making using the system for post-edition feasible. |
translated, making using the system for post-edition feasible. |
||
One of the key aspects of Kevin's work was the re-use and adaptation of existing open source |
|||
resources. Much of the bilingual dictionary was statistically inferred from the existing |
|||
translations in [http://www.kde.org KDE], using [http://sourceforge.net/projects/retratos/ ReTraTos] and |
|||
[http://www.fjoch.com/GIZA++.html GIZA++] (created by Franz Och, now a research scientist at Google Translate). |
|||
In addition to this, Kevin used the [http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger Oslo-Bergen Constraint Grammer], contributing fixes not only to that, but to the [http://beta.visl.sdu.dk/cg3.html VISL CG3] software itself. |
|||
===A translator for Swedish (sv) to Danish (da)=== |
===A translator for Swedish (sv) to Danish (da)=== |
Revision as of 10:36, 12 September 2009
This was our first year in Google Summer of Code and we were very fortunate to receive nine student slots. We filled them with some great students and are pleased to report that out of the nine projects, eight were successful. Along with their end of project reports, students have also been invited to write papers along with their mentors for review in an academic workshop on free and open-source rule-based machine translation that we are organising with the mentors' money.
A translator for Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)
This project was accepted as part of our "adopt a language pair" idea from our ideas page. Some work had already been done on the translator but it was a long way from finished. Kevin Unhammer from the University of Bergen was mentored by Trond Trosterud from the University of Tromsø. The final result, after an epic effort, is a working translator (indeed the first free software translator for nb-nn) that makes a mistake in only 11 words out of every 100 translated, making using the system for post-edition feasible.
One of the key aspects of Kevin's work was the re-use and adaptation of existing open source resources. Much of the bilingual dictionary was statistically inferred from the existing translations in KDE, using ReTraTos and GIZA++ (created by Franz Och, now a research scientist at Google Translate). In addition to this, Kevin used the Oslo-Bergen Constraint Grammer, contributing fixes not only to that, but to the VISL CG3 software itself.
A translator for Swedish (sv) to Danish (da)
Another language pair adoption, Michael Kristensen, who had previously done some work on this translator was mentored by Jacob Nordfalk, the author of our English to Esperanto translator. As there are very few free linguistic resources for Swedish and Danish the work was pretty much started from scratch, although we took great advantage of the Swedish Wiktionary. The translator is only unidirectional from Swedish to Danish and it has an error rate of around 20%.
Multi-engine machine translation
Gabriel Synnaeve was mentored by Francis Tyers to work on a module to improve the quality of machine translation by taking translations from different systems and merging their strengths and discard weaknesses. The two systems focussed on in the initial prototype are Apertium (rule-based MT) and Moses (statistical MT) but it can easily be extended to more. The idea behind the system is that for some languages there is often not one MT system which is better than all others, but some are better at some phrases and some are better at others. Thus if we can combine the output of two or more systems with different strengths/weaknesses we can make better translations.
Apertium webservice
Conversion of Anubadok
Abu Zaher was mentored by Kevin Donnelly and Francis Tyers to convert Anubadok, an open-source MT system for English to Bengali to work with the Apertium engine. This was an ambitious project and not all of the goals were realised, but we were able to make the first wide-coverage morphological analyser / generator for Bengali and a substantial amount of lexical transfer, so the project was a great success.
Zaher is also looking at improving the Ankur spell checker with information from his analyser / generator, so the work done is being reused for other things.
Apertium scaleable architecture
Trigram part-of-speech tagging
Zaid Md. Abdul Wahab Sheikh was mentored by Felipe Sánchez Martínez to improve our part-of-speech tagging module to use trigrams instead of bigrams. This gives more context for disambiguation, which will hopefully result in a more accurate tagging. The project was successful with all the coding done, including adaptation for target-language mediated training.
Java port of lttoolbox
Raphaël Laurent worked with Sergio Ortiz Rojas to port lttoolbox to Java. lttoolbox is the core component of the Apertium system; as well as providing morphological analysis and generation, it also provides pattern matching and dictionary lookup to the rest of Apertium, so a Java port is the first step towards a version of Apertium for Java-based devices. Raphaël finished an earlier line-for-line port contributed by Nic Cotrell, first making it work; then making it binary compatible.
As it stands currently, lttoolbox-java can be integrated into other Java-based tools, facilitating the re-use of our software and our extensive repository of morphological analysers. Tools such as LanguageTool, the open source proofreading tool, also make extensive use of morphological analysis, but OmegaT, the open source CAT tool, could use it for dictionary look-up of inflected words; it could even be used with our own apertium-morph tool: a plugin for Lucene that allows linguistically-rich document indexing.