Google Summer of Code/Report 2009

From Apertium
Jump to navigation Jump to search

Norwegian Nynorsk and Bokmål (Unhammer)

apertium-nn-nb is now in a fairly usable state for translating both from Nynorsk to Bokmål and from Bokmål to Nynorsk.

The bidix currently has a little over 50000 entries (46000 discounting restrictions), and the dictionaries consistent, ie. all entries translate without #, /, or @ marks... from what I can tell ;-) The bidix initially contained about 36000 entries that had exact translations and almost no restrictions. I added some entries semi-automatically by changing substrings (ie. adjectives and adverbs ending in -lig in nb typically end in -leg in nn) and checking whether they existed in the other monodix, some by running poterminology and then ReTraTos/Giza++ on the KDE4 corpus of .po files, and some by ReTraTos/Giza++ on bitextor output. The rest were more or less manually added and checked.

I've pretty much stuck to my original GsoC week plan; converting the Oslo-Bergen Constraint Grammer disambiguator to Apertium tags went fairly easily, as did adding corrections where I found things to correct (these were of course reported "upstream"). I had to write the transfer rules from scratch, but this also went quite easily with some help from various Apertiumers and of course the Apertium documentation; nn-nb uses only one-stage transfer since the languages are quite closely related. The 33 rules correctly transfer both genitive noun phrase differences (with adjectives), and passive verb differences, in addition to adjective and determiner congruence. Of course, there is a lot more that can be done here... as well as with CG.

A side-effect of my Apertium work is dix-mode.el ( ), a minor mode for editing .dix files in Emacs. In case Apertium ever gets more than one contributor using Emacs.

Results of an initial WER test are at (comparing MT output to the post-edited version); 11% with 64% free rides mostly due to names and terminology/loan words. With the mediawiki formatters under way, I have hopes that Apertium could help catch up with the Bokmål version...

Swedish to Danish (mkrist)

Conversion of Anubadok (darthxaher)

Webservice (deadbeef)

Multi-engine machine translation (snippyhollow)

Scaling Apertium (vitaka)

The Highly scalable web service architecture for Apertium has successfully improved the Apertium platform scalability and provides a clear JSON REST web service API based on Google Translation API.

The project have reached its three main objectives:

Make Apertium work in daemon mode. The null flush option has been tested in all the modules of Apertium pipeline, some patches have been submitted and it has been added to Constraint Grammar. Wrapper Java classes have been written, so it is possible to start/stop an Apertium daemon and translate with it from any Java program.

Create a REST web service interface. There is a request router that, when deployed on a web server, can process HTTP translation requests, send them to the right daemon and return a JSON object with the result. The API includes the appropriate options to allow JavaScript clients bypass browser same origin policy. It is compatible with Google Translation API. Here is the API Specification: [1]

Create a highly scalable architecture. The request router, that processes translation requests and send them to the right translation server, and translation servers, that run Apertium and perform the translation, are the main elements of the architecture. Depending on the language pairs of the requests received and the server capacities, the request router decides which daemon must run on each server, and starts and stops them to make the system state fit its decision. Note that a daemon can translate with only one language pair. The placement is described in [2]. Additionally, the system can be configured to add translation servers when load rises. These servers can be standard machines in a local network or Amazon EC2 instances.

Although the three main objective have been reached, the system have some limitations. The architecture can be improved because the request router acts as a bottleneck, and adding more servers won't make the system perform more translations. Also, computing the placement algorithm on the same machine that processes the requests limits the throughput. I am studying these limitations in the paper I am writing.

I tried to follow my initial schedule. Making Apertium work in daemon mode took me less time than I expected, but implementing the placement algorithm took me more time. I finally didn't implement a SOAP API.

Trigram tagger (zaid)

Lttoolbox Java (raah)