Difference between revisions of "User:Darshak/GSoC 2014 Report"

From Apertium
Jump to navigation Jump to search
(Created page with "== Description == This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts whic...")
 
Line 22: Line 22:
 
* 933 female given names
 
* 933 female given names
 
* 2000+ surnames
 
* 2000+ surnames
* and the names of a few companies and productsm, most of which were likely to be mistranslated due to being dictionary words
+
* and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words
   
 
== Constraint Grammar ==
 
== Constraint Grammar ==

Revision as of 19:36, 18 August 2014

Description

This project aimed to enhance the quality of English to Esperanto translation. The start was a bit rough but I caught up. There still remain some parts which need to be worked on, but the overall translation has improved.

Supervised Tagger Training

The English corpora available on the SVN repo were used to train the tagger. One example of how this improved the translation:

Before

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumaj faloj trans la lando.

After

$ echo 'Darkness falls across the land.' | apertium -d . en-eo
Mallumo falas trans la lando.

Vocabulary

Thanks to tagger training, a lot of missing multiwords were identified and subsequently added. Moreover, a number of proper names were also added. In particular,

  • 922 male given names
  • 933 female given names
  • 2000+ surnames
  • and the names of a few companies and products, most of which were likely to be mistranslated due to being dictionary words

Constraint Grammar

I used the English CG rules from English-Kazakh language pair as a base. It contained 124 rules, 16 of which I modified. 15 new rules were added and 1 was removed.

Structural Transfer

Structural transfer rules were added for almost all possible date formats used in English. Moreover, some rules were added for cases where correctly tagging a token would not suffice.