Difference between revisions of "Google Summer of Code/Report 2009"

From Apertium
Jump to navigation Jump to search
Line 49: Line 49:
Using Anubadok as a reference, my primary project goal was to create a functional English to Bengali translation system. The project primarily consisted of three stages.: a. create a morphological Generator+Analyzer, b. create a Bdix and c. create a transfer system.
Using Anubadok as a reference, my primary project goal was to create a functional English to Bengali translation system. The project primarily consisted of three stages.: a. create a morphological Generator+Analyzer, b. create a Bdix and c. create a transfer system.


Creating a morphological analyzer/generator took most of my time. Anubadok uses 'Penn Tagset' but it is fairly aligned with the Tagset I'm using right now. However, as Apertium requires more information per word i.e. number, gender, animacy so I had to manually tag a lot of them. Anubadoks's dix is not frequency based, therefore a lot of the tagged words might be archaic/ not frequencly used. So I treid tagging the most frequency used missing words. Right now we have a roughly 68% coverage of the 20K most freq used words. At first I was strictly forcused on creating a generator, but later focused on the analyzer too (given the highly inflectional properties of Bengali, creating a analyzer is a bit harder for all the alternate forms for a word).
Creating a morphological analyzer/generator took most of my time. Anubadok uses 'Penn Tagset' but it is fairly aligned with the Tagset I'm using right now. However, as Apertium requires more information per word i.e. number, gender, animacy so I had to manually tag a lot of them. Anubadoks's dix is not frequency based, therefore a lot of the tagged words might be archaic/ not frequency used. So I tried tagging the most frequency used missing words. Right now we have a roughly 68% coverage of the 20K most freq used words. At first I was strictly focused on creating a generator, but later focused on the analyzer too (given the highly inflectional properties of Bengali, creating a analyzer is a bit harder for all the alternate forms for a word).


The Bdix has most of the entries inherited from the Bengali monodix, but some new ones are still pending.
The Bdix has most of the entries inherited from the Bengali monodix, but some new ones are still pending.

Revision as of 05:17, 10 September 2009

Norwegian Nynorsk and Bokmål (Unhammer)

apertium-nn-nb is now in a fairly usable state for translating both from Nynorsk to Bokmål and from Bokmål to Nynorsk.

The bidix currently has a little over 50000 entries (46000 discounting restrictions), and the dictionaries consistent, ie. all entries translate without #, /, or @ marks... from what I can tell ;-) The bidix initially contained about 36000 entries that had exact translations and almost no restrictions. I added some entries semi-automatically by changing substrings (ie. adjectives and adverbs ending in -lig in nb typically end in -leg in nn) and checking whether they existed in the other monodix, some by running poterminology and then ReTraTos/Giza++ on the KDE4 corpus of .po files, and some by ReTraTos/Giza++ on bitextor output. The rest were more or less manually added and checked.

I've pretty much stuck to my original GsoC week plan; converting the Oslo-Bergen Constraint Grammer disambiguator to Apertium tags went fairly easily, as did adding corrections where I found things to correct (these were of course reported "upstream"). I had to write the transfer rules from scratch, but this also went quite easily with some help from various Apertiumers and of course the Apertium documentation; nn-nb uses only one-stage transfer since the languages are quite closely related. The 33 nb=>nn rules correctly transfer both genitive noun phrase differences (with adjectives), and passive verb differences, in addition to adjective and determiner congruence. Of course, there is a lot more that can be done here... as well as with CG.

A side-effect of my Apertium work is dix-mode.el ( http://wiki.apertium.org/wiki/Emacs ), a minor mode for editing .dix files in Emacs. In case Apertium ever gets more than one contributor using Emacs.

Results of an initial WER test are at http://wiki.apertium.org/wiki/Norsk#WER-test_28.2F8_2009 (comparing MT output to the post-edited version); 11% with 64% free rides mostly due to names and terminology/loan words. With the mediawiki formatters under way, I have hopes that Apertium could help nn.wikipedia.org catch up with the Bokmål version...

Swedish to Danish (mkrist)

Conversion of Anubadok (darthxaher) [draft]

Using Anubadok as a reference, my primary project goal was to create a functional English to Bengali translation system. The project primarily consisted of three stages.: a. create a morphological Generator+Analyzer, b. create a Bdix and c. create a transfer system.

Creating a morphological analyzer/generator took most of my time. Anubadok uses 'Penn Tagset' but it is fairly aligned with the Tagset I'm using right now. However, as Apertium requires more information per word i.e. number, gender, animacy so I had to manually tag a lot of them. Anubadoks's dix is not frequency based, therefore a lot of the tagged words might be archaic/ not frequency used. So I tried tagging the most frequency used missing words. Right now we have a roughly 68% coverage of the 20K most freq used words. At first I was strictly focused on creating a generator, but later focused on the analyzer too (given the highly inflectional properties of Bengali, creating a analyzer is a bit harder for all the alternate forms for a word).

The Bdix has most of the entries inherited from the Bengali monodix, but some new ones are still pending.

The Transfer System is in very simple stage right now and will need a lot of tweaking.

Statistics

Webservice (deadbeef)

Multi-engine machine translation (snippyhollow)

Scaling Apertium (vitaka)

The Highly scalable web service architecture for Apertium has successfully improved the Apertium platform scalability and provides a clear JSON REST web service API based on Google Translation API.

The project have reached its three main objectives:

Make Apertium work in daemon mode. The null flush option has been tested in all the modules of Apertium pipeline, some patches have been submitted and it has been added to Constraint Grammar. Wrapper Java classes have been written, so it is possible to start/stop an Apertium daemon and translate with it from any Java program.

Create a REST web service interface. There is a request router that, when deployed on a web server, can process HTTP translation requests, send them to the right daemon and return a JSON object with the result. The API includes the appropriate options to allow JavaScript clients bypass browser same origin policy. It is compatible with Google Translation API. Here is the API Specification: [1]

Create a highly scalable architecture. The request router, that processes translation requests and send them to the right translation server, and translation servers, that run Apertium and perform the translation, are the main elements of the architecture. Depending on the language pairs of the requests received and the server capacities, the request router decides which daemon must run on each server, and starts and stops them to make the system state fit its decision. Note that a daemon can translate with only one language pair. The placement algorithm is described in [2]. Additionally, the system can be configured to add translation servers when load rises. These servers can be standard machines in a local network or Amazon EC2 instances.


Although the three main objectives have been reached, the system have some limitations. The architecture can be improved because, when a certain number of servers are running, the request router acts as a bottleneck, and adding more servers won't make the system perform more translations. Also, computing the placement algorithm on the same machine that processes the requests limits the throughput. I am studying these limitations in the paper I am writing.


I tried to follow my initial schedule. Making Apertium work in daemon mode took me less time than I expected, but implementing the placement algorithm took me more time. I finally didn't implement a SOAP API.

Trigram tagger (zaid)

Lttoolbox Java (raah)