Google Summer of Code/Report 2009
Norwegian Nynorsk and Bokmål (Unhammer)
apertium-nn-nb is now in a fairly usable state for translating both from Nynorsk to Bokmål and from Bokmål to Nynorsk.
The bidix currently has a little over 50000 entries (46000 discounting restrictions), and the dictionaries consistent, ie. all entries translate without #, /, or @ marks... from what I can tell ;-) The bidix initially contained about 36000 entries that had exact translations and almost no restrictions. I added some entries semi-automatically by changing substrings (ie. adjectives and adverbs ending in -lig in nb typically end in -leg in nn) and checking whether they existed in the other monodix, some by running poterminology and then ReTraTos/Giza++ on the KDE4 corpus of .po files, and some by ReTraTos/Giza++ on bitextor output. The rest were more or less manually added and checked.
I've pretty much stuck to my original GsoC week plan; converting the Oslo-Bergen Constraint Grammer disambiguator to Apertium tags went fairly easily, as did adding corrections where I found things to correct (these were of course reported "upstream"). I had to write the transfer rules from scratch, but this also went quite easily with some help from various Apertiumers and of course the Apertium documentation; nn-nb uses only one-stage transfer since the languages are quite closely related. The 33 nb=>nn rules correctly transfer both genitive noun phrase differences (with adjectives), and passive verb differences, in addition to adjective and determiner congruence. Of course, there is a lot more that can be done here... as well as with CG.
A side-effect of my Apertium work is dix-mode.el ( http://wiki.apertium.org/wiki/Emacs ), a minor mode for editing .dix files in Emacs. In case Apertium ever gets more than one contributor using Emacs.
Results of an initial WER test are at http://wiki.apertium.org/wiki/Norsk#WER-test_28.2F8_2009 (comparing MT output to the post-edited version); 11% with 64% free rides mostly due to names and terminology/loan words. With the mediawiki formatters under way, I have hopes that Apertium could help nn.wikipedia.org catch up with the Bokmål version...
Swedish to Danish (mkrist)
Conversion of Anubadok (darthxaher) [draft]
Using Anubadok as a reference, my primary project goal was to create a functional English to Bengali translation system. The project primarily consisted of three stages.: a. create a morphological Generator+Analyzer, b. create a Bdix and c. create a transfer system.
Creating a morphological analyzer/generator took most of my time. Anubadok uses 'Penn Tagset' but it is fairly aligned with the Tagset I'm using right now. However, as Apertium requires more information per word i.e. number, gender, animacy so I had to manually tag a lot of them. Anubadoks's dix is not frequency based, therefore a lot of the tagged words might be archaic/ not frequency used. So I tried tagging the most frequency used missing words. Right now we have a roughly 68% coverage of the 20K most freq used words. At first I was strictly focused on creating a generator, but later focused on the analyzer too (given the highly inflectional properties of Bengali, creating a analyzer is a bit harder for all the alternate forms for a word).
The Bdix has most of the entries inherited from the Bengali monodix, but some new ones are still pending.
The Transfer System is in very simple stage right now and will need a lot of tweaking.
Statistics
Apertium going SOA (deadbeef)
The aim of this project was to design and implement an "Apertium Service" that can be easy integrated into IT systems implemented using a model based on a Service-Oriented Architecture (SOA). Two fundamental requisites to this service were:
- It should be easy to integrate to new and existing applications and services;
- It should be able to scale efficiently a large number of concurrent requests.
Multi-engine machine translation (snippyhollow)
This work is based on the previous article from S. Jayaraman & A. Lavie (2005) "Multi-Engine Machine Translation Guided by Explicit Word Matching". ACL 2005. The goal was to provide the free software community with Multi-Engine Machine Translation system licensed under GPL. The MEMT system is fully operational even though the matching/stemming are quite basic and they seem to be a bottleneck for high quality hypothesis generation.
The system works "by sentences" and takes n outputs from n different MT engines as input. It matches the words in the translations using a stemmer and a case insensitive match. This matching is then used to construct an alignment of the words, or of n-grams. This alignment is used to generate hypotheses that are then ranked using their construction score and the one of a statistical ranker (as IRSTLM in the current implementation), so that the system outputs the best hypothesis.
IRSTLM and MOSES are needed for the compilation, a language model of the target language is needed for the ranking with IRSTLM. For more details about the setup, see "apertium-combine/README". The system is designed in the sequential use of strategy patterns as seen in "apertium-combine/memt/apertium_combiner.cc". One just has to concentrate on the logic part of the code for changing one to all of the matcher/aligner/generator/ranker that can be loaded dynamically.
Scaling Apertium (vitaka)
The Highly scalable web service architecture for Apertium has successfully improved the Apertium platform scalability and provides a clear JSON REST web service API based on Google Translation API.
The project have reached its three main objectives:
Make Apertium work in daemon mode. The null flush option has been tested in all the modules of Apertium pipeline, some patches have been submitted and it has been added to Constraint Grammar. Wrapper Java classes have been written, so it is possible to start/stop an Apertium daemon and translate with it from any Java program.
Create a REST web service interface. There is a request router that, when deployed on a web server, can process HTTP translation requests, send them to the right daemon and return a JSON object with the result. The API includes the appropriate options to allow JavaScript clients bypass browser same origin policy. It is compatible with Google Translation API. Here is the API Specification: [1]
Create a highly scalable architecture. The request router, that processes translation requests and send them to the right translation server, and translation servers, that run Apertium and perform the translation, are the main elements of the architecture. Depending on the language pairs of the requests received and the server capacities, the request router decides which daemon must run on each server, and starts and stops them to make the system state fit its decision. Note that a daemon can translate with only one language pair. The placement algorithm is described in [2]. Additionally, the system can be configured to add translation servers when load rises. These servers can be standard machines in a local network or Amazon EC2 instances.
Although the three main objectives have been reached, the system have some limitations. The architecture can be improved because, when a certain number of servers are running, the request router acts as a bottleneck, and adding more servers won't make the system perform more translations. Also, computing the placement algorithm on the same machine that processes the requests limits the throughput. I am studying these limitations in the paper I am writing.
I tried to follow my initial schedule. Making Apertium work in daemon mode took me less time than I expected, but implementing the placement algorithm took me more time. I finally didn't implement a SOAP API.
Trigram Tagger (Zaid)
This project was aimed at implementing a trigram tagger for Apertium based on 2nd order Hidden Markov Models, and supervised as well as unsupervised methods for training the tagger. The other objective was to extend the target language tagger training method to the trigram tagger. [3]
All of the above have already been implemented. The code for trigram tagger and the various methods to train it can be found in svn. [4] Code for the TL-based (trigram) tagger training can be found here. [5]
The various algorithms for the training and decoding can be found here:
The evaluations of the new Trigram tagger is currently being done (as of 11 Sept 2009).
Most of the above methods are based on the works of Felipe Sánchez Martínez, whose publications can be found here. And the initial Google Summer of Code proposal for this project can be found here.