Difference between revisions of "Google Summer of Code/Report 2009"

From Apertium
Jump to navigation Jump to search
Line 50: Line 50:
   
 
==Multi-engine machine translation (snippyhollow)==
 
==Multi-engine machine translation (snippyhollow)==
  +
  +
==Scaling Apertium (vitaka)==
  +
  +
==Trigram tagger (zaid)==
  +
  +
==Lttoolbox Java (raah)==
   
 
[[Category:Google Summer of Code]]
 
[[Category:Google Summer of Code]]

Revision as of 14:28, 6 September 2009

Norwegian Nynorsk and Bokmål (Unhammer)

apertium-nn-nb is now in a fairly usable state for translating both from Nynorsk to Bokmål and from Bokmål to Nynorsk.

The bidix currently has a little over 50000 entries (46000 discounting restrictions), and the dictionaries consistent, ie. all entries translate without #, /, or @ marks... from what I can tell ;-) The bidix initially contained about 36000 entries that had exact translations and almost no restrictions. I added some entries semi-automatically by changing substrings (ie. adjectives and adverbs ending in -lig in nb typically end in -leg in nn) and checking whether they existed in the other monodix, some by running poterminology and then ReTraTos/Giza++ on the KDE4 corpus of .po files, and some by ReTraTos/Giza++ on bitextor output. The rest were more or less manually added and checked.

I've pretty much stuck to my original GsoC week plan; converting the Oslo-Bergen Constraint Grammer disambiguator to Apertium tags went fairly easily, as did adding corrections where I found things to correct (these were of course reported "upstream"). I had to write the transfer rules from scratch, but this also went quite easily with some help from various Apertiumers and of course the Apertium documentation; nn-nb uses only one-stage transfer since the languages are quite closely related. The 33 rules correctly transfer both genitive noun phrase differences (with adjectives), and passive verb differences, in addition to adjective and determiner congruence. Of course, there is a lot more that can be done here... as well as with CG.

A side-effect of my Apertium work is dix-mode.el ( http://wiki.apertium.org/wiki/Emacs ), a minor mode for editing .dix files in Emacs. In case Apertium ever gets more than one contributor using Emacs.

Results of an initial WER test are at http://wiki.apertium.org/wiki/Norsk#WER-test_28.2F8_2009 (comparing MT output to the post-edited version); 11% with 64% free rides mostly due to names and terminology/loan words. With the mediawiki formatters under way, I have hopes that Apertium could help nn.wikipedia.org catch up with the Bokmål version...

Swedish to Danish (mkrist)

Conversion of Anubadok (darthxaher)

Webservice (deadbeef)

Multi-engine machine translation (snippyhollow)

Scaling Apertium (vitaka)

Trigram tagger (zaid)

Lttoolbox Java (raah)