Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:Darshak/GSoC 2014 Report

From Apertium
Jump to: navigation, search


[edit] Description

This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts which need to be worked on, but the overall translation has improved.

[edit] Supervised Tagger Training

The English corpora available on the SVN repo were used to train the tagger. One example of how this improved the translation:

[edit] Before

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumaj faloj trans la lando.

[edit] After

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumo falas trans la lando.

[edit] Vocabulary

Thanks to tagger training, a lot of missing multiwords were identified and subsequently added. Moreover, a number of proper names were also added. In particular,

  • 922 male given names
  • 933 female given names
  • 2000+ surnames
  • and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words

[edit] Constraint Grammar

I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed (more like, commented).

The accuracy was calculated on a part of the corpus used for tagger training. The accuracy on this corpus increased from 73.67% to 76.19%.

[edit] Structural Transfer

Structural transfer rules were added for almost all possible date formats used in English. Moreover, some rules were added for cases where correctly tagging a token would not suffice.

[edit] Results

[edit] Coverage

With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words).

[edit] Translation quality

To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501.

The word error rate (WER) for the second text was, in the initial version, 29.59%, and in the final version 27.72%.
The position-independent word error rate (PER) was, respectively, 25.47% and 22.47%.

The word error rate (WER) for the second text was, in the initial version, 38.52%, and in the final version 34.93%.
The position-independent word error rate (PER) was, respectively, 30.14% and 27.15%.

So translation quality seemly has improved between 2 and 4 percentage points.

[edit] Future Work

Although there has been improvement, there's still quite some work to do before we can call it state-of-the-art.

  • Better interchunk rules, because a lot of problematic sentences at English_and_Esperanto/Outstanding_tests can be solved only by that.
  • While the new proper names have significantly expanded the dix, they have brought with them some ambiguities. These need to be solved.

[edit] Thanks

Jacob and Hector have guided me through the thick and thin of the trimester. Also, many others from Apertium community have helped me when I was confused about something. So I sincerely thank all of them.

Personal tools