User:Darshak/GSoC 2014 Report

Description[edit]

This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts which need to be worked on, but the overall translation has improved.

Supervised Tagger Training[edit]

The English corpora available on the SVN repo were used to train the tagger. One example of how this improved the translation:

Before[edit]

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumaj faloj trans la lando.

After[edit]

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumo falas trans la lando.

Vocabulary[edit]

Thanks to tagger training, a lot of missing multiwords were identified and subsequently added. Moreover, a number of proper names were also added. In particular,

922 male given names
933 female given names
2000+ surnames
and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words

Constraint Grammar[edit]

I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed (more like, commented).

The accuracy was calculated on a part of the corpus used for tagger training. The accuracy on this corpus increased from 73.67% to 76.19%.

Structural Transfer[edit]

Structural transfer rules were added for almost all possible date formats used in English. Moreover, some rules were added for cases where correctly tagging a token would not suffice.

Results[edit]

Coverage[edit]

With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words).

Translation quality[edit]

To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501.

The word error rate (WER) for the second text was, in the initial version, 29.59%, and in the final version 27.72%.
The position-independent word error rate (PER) was, respectively, 25.47% and 22.47%.

The word error rate (WER) for the second text was, in the initial version, 38.52%, and in the final version 34.93%.
The position-independent word error rate (PER) was, respectively, 30.14% and 27.15%.

So translation quality seemly has improved between 2 and 4 percentage points.

Future Work[edit]

Although there has been improvement, there's still quite some work to do before we can call it state-of-the-art.

Better interchunk rules, because a lot of problematic sentences at English_and_Esperanto/Outstanding_tests can be solved only by that.
While the new proper names have significantly expanded the dix, they have brought with them some ambiguities. These need to be solved.

Thanks[edit]

Jacob and Hector have guided me through the thick and thin of the trimester. Also, many others from Apertium community have helped me when I was confused about something. So I sincerely thank all of them.

User:Darshak/GSoC 2014 Report

Contents

Description[edit]

Supervised Tagger Training[edit]

Before[edit]

After[edit]

Vocabulary[edit]

Constraint Grammar[edit]

Structural Transfer[edit]

Results[edit]

Coverage[edit]

Translation quality[edit]

Future Work[edit]

Thanks[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools