User:Darshak/GSoC 2014 Report

From Apertium
Jump to navigation Jump to search

Description

This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts which need to be worked on, but the overall translation has improved.

Supervised Tagger Training

The English corpora available on the SVN repo were used to train the tagger. One example of how this improved the translation:

Before

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumaj faloj trans la lando.

After

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumo falas trans la lando.

Vocabulary

Thanks to tagger training, a lot of missing multiwords were identified and subsequently added. Moreover, a number of proper names were also added. In particular,

  • 922 male given names
  • 933 female given names
  • 2000+ surnames
  • and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words

Constraint Grammar

I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed (more like, commented).

The accuracy was calculated on a part of the corpus used for tagger training. The accuracy on this corpus increased from 73.67% to 76.19%.

Structural Transfer

Structural transfer rules were added for almost all possible date formats used in English. Moreover, some rules were added for cases where correctly tagging a token would not suffice.

Results

Coverage

With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words).

Translation quality

To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501.

The word error rate (WER) for the second text was, in the initial version, 29.59%, and in the final version 27.72%.
The position-independent word error rate (PER) was, respectively, 25.47% and 22.47%.

The word error rate (WER) for the second text was, in the initial version, 38.52%, and in the final version 34.93%.
The position-independent word error rate (PER) was, respectively, 30.14% and 27.15%.

So translation quality seemly has improved between 2 and 4 percentage points.

Future Work

Although there has been improvement, there's still quite some work to do before we can call it state-of-the-art.

  • Better interchunk rules, because a lot of problematic sentences at English_and_Esperanto/Outstanding_tests can be solved only by that.
  • While the new proper names have significantly expanded the dix, they have brought with them some ambiguities. These need to be solved.

Thanks

Jacob and Hector have guided me through the thick and thin of the trimester. Also, many others from Apertium community have helped me when I was confused about something. So I sincerely thank all of them.