Difference between revisions of "Ideas for Google Summer of Code/Make a language pair state-of-the-art"

From Apertium
Jump to navigation Jump to search
(Created page with "==Coding challenge== ==Frequently asked questions== ==See also== Make a language pair state-of-the-art")
 
(7 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
  +
  +
Take a released language pair, and drastically improve the performance both in terms of coverage, and in terms of translation quality. This will involve working with dictionaries, transfer rules, scripting, corpora. The objective is to make an Apertium language pair state-of-the-art, or close to state-of-the-art in terms of translation quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word error rate is 30%, then it should be reduced to 15-20%.
  +
 
==Coding challenge==
 
==Coding challenge==
  +
  +
* Find a language pair of your choice in Apertium and install it. (see [[Minimal installation from SVN]])
  +
* Translate 2,000 words of text (e.g. four articles of 500 words) using Apertium.
  +
* Postedit the translated text to make a reference translation.
  +
* Use two articles ( input text and translated post-edited text ) to improve the translator.
  +
** Add all the words, and cover all the structures with transfer rules.
  +
* [[Evaluation]]: calculate the improvement that you were able to make on these two articles, and on your two held out articles.
   
 
==Frequently asked questions==
 
==Frequently asked questions==
   
  +
; What if my pair is composed of two popular languages, for instance two official languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working languages, tend to have very good statistical machine translation software already. Your work will be valuable, and accepted, only if you manage to reach a comparable quality. For instance, if europarl+moses gets a [[WER]] of 15-20%, we'd be happy with 25%.
==See also==
 
   
  +
; What happens if I don't reach the expected results? : It's no big deal! GSoC is a scholarship, not a service contract. If you don't deliver what agreed/expected, you'll fail the [http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page#9._How_do_evaluations_work final or midterm evaluation] and [http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page#1._How_do_payments_work lose the consequent stipend installment(s)]. At least you tried! And, hopefully, learnt a lot in the process.
  +
  +
More: ''[[contact|ask us]] something!'' :)
  +
 
==See also==
   
  +
* An example work plan for a language pair: [[Maltese_and_Arabic/Work_plan]]
   
 
[[Category:Ideas for Google Summer of Code|Make a language pair state-of-the-art]]
 
[[Category:Ideas for Google Summer of Code|Make a language pair state-of-the-art]]

Revision as of 19:33, 15 March 2017

Take a released language pair, and drastically improve the performance both in terms of coverage, and in terms of translation quality. This will involve working with dictionaries, transfer rules, scripting, corpora. The objective is to make an Apertium language pair state-of-the-art, or close to state-of-the-art in terms of translation quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word error rate is 30%, then it should be reduced to 15-20%.

Coding challenge

  • Find a language pair of your choice in Apertium and install it. (see Minimal installation from SVN)
  • Translate 2,000 words of text (e.g. four articles of 500 words) using Apertium.
  • Postedit the translated text to make a reference translation.
  • Use two articles ( input text and translated post-edited text ) to improve the translator.
    • Add all the words, and cover all the structures with transfer rules.
  • Evaluation: calculate the improvement that you were able to make on these two articles, and on your two held out articles.

Frequently asked questions

What if my pair is composed of two popular languages, for instance two official languages of EU? 
Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working languages, tend to have very good statistical machine translation software already. Your work will be valuable, and accepted, only if you manage to reach a comparable quality. For instance, if europarl+moses gets a WER of 15-20%, we'd be happy with 25%.
What happens if I don't reach the expected results? 
It's no big deal! GSoC is a scholarship, not a service contract. If you don't deliver what agreed/expected, you'll fail the final or midterm evaluation and lose the consequent stipend installment(s). At least you tried! And, hopefully, learnt a lot in the process.

More: ask us something! :)

See also