Difference between revisions of "User:Darshak/GSoC 2014 Report"
Hectoralos (talk | contribs) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 11: | Line 11: | ||
<pre>$ echo 'Darkness falls across the land.' | apertium -d . en-eo |
<pre>$ echo 'Darkness falls across the land.' | apertium -d . en-eo |
||
Mallumaj faloj trans la lando.</pre> |
Mallumaj faloj trans la lando.</pre> |
||
=== After === |
=== After === |
||
Line 26: | Line 27: | ||
== Constraint Grammar == |
== Constraint Grammar == |
||
I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed. |
I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed (more like, commented). |
||
The accuracy was calculated on a part of the corpus used for tagger training. The accuracy on this corpus increased from 73.67% to 76.19%. |
|||
== Structural Transfer == |
== Structural Transfer == |
||
Line 35: | Line 38: | ||
=== Coverage === |
=== Coverage === |
||
With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words). |
With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words). |
||
=== Translation quality === |
=== Translation quality === |
||
To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501. |
To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501. |
||
Line 49: | Line 50: | ||
So translation quality seemly has improved between 2 and 4 percentage points. |
So translation quality seemly has improved between 2 and 4 percentage points. |
||
==Future Work== |
|||
Although there has been improvement, there's still quite some work to do before we can call it state-of-the-art. |
|||
* Better interchunk rules, because a lot of problematic sentences at [[English_and_Esperanto/Outstanding_tests]] can be solved only by that. |
|||
* While the new proper names have significantly expanded the dix, they have brought with them some ambiguities. These need to be solved. |
|||
==Thanks== |
|||
Jacob and Hector have guided me through the thick and thin of the trimester. Also, many others from Apertium community have helped me when I was confused about something. So I sincerely thank all of them. |
Latest revision as of 18:11, 21 August 2014
Contents
Description[edit]
This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts which need to be worked on, but the overall translation has improved.
Supervised Tagger Training[edit]
The English corpora available on the SVN repo were used to train the tagger. One example of how this improved the translation:
Before[edit]
$ echo 'Darkness falls across the land.' | apertium -d . en-eo Mallumaj faloj trans la lando.
After[edit]
$ echo 'Darkness falls across the land.' | apertium -d . en-eo Mallumo falas trans la lando.
Vocabulary[edit]
Thanks to tagger training, a lot of missing multiwords were identified and subsequently added. Moreover, a number of proper names were also added. In particular,
- 922 male given names
- 933 female given names
- 2000+ surnames
- and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words
Constraint Grammar[edit]
I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed (more like, commented).
The accuracy was calculated on a part of the corpus used for tagger training. The accuracy on this corpus increased from 73.67% to 76.19%.
Structural Transfer[edit]
Structural transfer rules were added for almost all possible date formats used in English. Moreover, some rules were added for cases where correctly tagging a token would not suffice.
Results[edit]
Coverage[edit]
With a Wikipedia corpus of 49,759,540 words, the coverage with the previous (trunk) version of the translator was 91.7% (4,123,893 unknown words). With the new version it improved to 92.0% (3,958,640 unknown words).
Translation quality[edit]
To measure translation quality I got Wikipedia's featured articles of the last two days (August, 18 and 19 2014), as a random sample of Wipipedia texts, which are the main translator's target ( https://en.wikipedia.org/wiki/Episode_2_%28Twin_Peaks%29 and https://en.wikipedia.org/wiki/Leslie_Groves ). The first text has 267 words, the second one 501.
The word error rate (WER) for the second text was, in the initial version, 29.59%, and in the final version 27.72%.
The position-independent word error rate (PER) was, respectively, 25.47% and 22.47%.
The word error rate (WER) for the second text was, in the initial version, 38.52%, and in the final version 34.93%.
The position-independent word error rate (PER) was, respectively, 30.14% and 27.15%.
So translation quality seemly has improved between 2 and 4 percentage points.
Future Work[edit]
Although there has been improvement, there's still quite some work to do before we can call it state-of-the-art.
- Better interchunk rules, because a lot of problematic sentences at English_and_Esperanto/Outstanding_tests can be solved only by that.
- While the new proper names have significantly expanded the dix, they have brought with them some ambiguities. These need to be solved.
Thanks[edit]
Jacob and Hector have guided me through the thick and thin of the trimester. Also, many others from Apertium community have helped me when I was confused about something. So I sincerely thank all of them.