Ideas for Google Summer of Code/Make a language pair state-of-the-art
Take a released language pair, and drastically improve the performance both in terms of coverage, and in terms of translation quality. This will involve working with dictionaries, transfer rules, scripting, corpora. The objective is to make an Apertium language pair state-of-the-art, or close to state-of-the-art in terms of translation quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word error rate is 30%, then it should be reduced to 15-20%.
- Find a language pair of your choice in Apertium and install it. (see Minimal installation from SVN)
- Translate 2,000 words of text (e.g. four articles of 500 words) using Apertium.
- Postedit the translated text to make a reference translation.
- Use two articles ( input text and translated post-edited text ) to improve the translator.
- Add all the words, and cover all the structures with transfer rules.
- Evaluation: calculate the improvement that you were able to make on these two articles, and on your two held out articles.
Frequently asked questions
- What if my pair is composed of two popular languages, for instance two official languages of EU?
- Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working languages, tend to have very good statistical machine translation software already. Your work will be valuable, and accepted, only if you manage to reach a comparable quality. For instance, if europarl+moses gets a WER of 15-20%, we'd be happy with 25%.
- What happens if I don't reach the expected results?
- It's no big deal! GSoC is a scholarship, not a service contract. If you don't deliver what agreed/expected, you'll fail the final or midterm evaluation and lose the consequent stipend installment(s). At least you tried! And, hopefully, learnt a lot in the process.
More: ask us something! :)
- An example work plan for a language pair: Maltese_and_Arabic/Work_plan