Difference between revisions of "User:Marcriera/proposal"

From Apertium
Jump to navigation Jump to search
Line 143: Line 143:
|
|
* Add new stems to dictionaries
* Add new stems to dictionaries
* Tagger training
| style="text-align:center" | ~59,000
| style="text-align:center" | ~59,000
| style="text-align:center" |
| style="text-align:center" |
Line 152: Line 153:
* Add new stems to dictionaries
* Add new stems to dictionaries
* Testvoc
* Testvoc
* Finish list of necessary rules by frequency
'''Deliverable #2'''
'''Deliverable #2'''
| style="text-align:center" | ~62,000
| style="text-align:center" | ~62,000

Revision as of 11:23, 1 April 2017

Contact Information

Name: Marc Riera Irigoyen

Location: Barcelona, Spain

E-mail: marc.riera.irigoyen@gmail.com

IRC: mriera_trad

SourceForge: marcriera

Timezone: UTC+02:00

Why is it you are interested in machine translation?

Before being interested in machine translation, I was very interested in translation and I decided to study a degree in Translation and Interpreting at university. After learning about computer-assisted translation, I became more and more interested in machine translation not as a replacement of human translation, but rather as a way to improve human translation effectiveness and productivity.

Why is it that you are interested in Apertium?

The Apertium project is very interesting thanks to its open-sourced nature. It is an opportunity to build not only a robust and fairly good rule-based machine translator, but also to help with the development of translation pairs that would be extremely difficult to be implemented in statistical machine translators, such as minority language pairs. As a native speaker of one of these languages (Catalan), the fact that Apertium can offer better results than other types of translators is very attractive.

Which of the published tasks are you interested in? What do you plan to do?

Currently, there is an English-Catalan language pair in trunk. However, this pair uses its own monolingual dictionaries, which makes future development more difficult. The aim of this project is to migrate the changes from this old pair to the new one under development in order to get rid of the en-ca pair, and then proceed to expand the dictionaries and transfer rules of the new eng-cat pair. For this purpose, featured Wikipedia articles and public domain books will be used. The work will be focused on translating from English to Catalan, but translation from Catalan to English will benefit from the expanded dictionaries.

Title

Bringing English-Catalan language pair to state-of-the-art quality

Reasons why Google and Apertium should sponsor it

While there is already an English-Catalan language pair, the quality of the translations can be improved a lot. While other machine translators already offer English-Catalan translation with good results, Apertium has the advantage of being a free and open-source project. This idea matches that of other internet projects such as Wikipedia, which could make use of the Apertium project by default to improve the quality of the translations and reduce post-edition, specially in more specialized language contexts.

How and who it will benefit in society

Catalan is a language with only 10 million speakers, but a very lively one. An improved rule-based machine translator for English and Catalan will allow Catalan speakers to benefit from better English-to-Catalan translations than the current results with statistical translators. The open-source nature of Apertium will hopefully encourage online content creators to start using translations in Catalan or to improve the current ones. English speakers will also benefit from the expanded bidix, and future Apertium developers will be able to further develop the language pair more easily thanks to the unification.

List your skills and give evidence of your qualifications

My mother languages are Catalan and Spanish, and I can also speak English and Japanese. I am currently studying a degree in Translation and Interpreting. I have collaborated with some open-source software projects as an English to Catalan translator, and during the previous term of the current academic year I translated a book for a book publisher. In addition, since 2015 I am an active programmer and translator in the OpenBVE project, an open source railway simulator. I am an experienced Debian and Fedora user, and I know C# and XML.

List any non-Summer-of-Code plans you have for the Summer

The first two weeks of June I have final exams, so I will not be able to work on the project as much as I would like to. After that date, I will be able to spend at least 30 hours a week on Apertium.

My plan

Major goals

While the most important goal is to merge both (old and new) language pairs, most of the work during the summer will be related to dictionaries and transfer rules:

  • Good WER (~20%)
  • Excellent coverage (~92%)
  • Testvoc clean
  • New stems in bidix (~3000 stems a week)
  • Additional transfer rules, lexical selection rules and, if necessary, CG.

Workplan

Week Dates Goals Bidix WER / PER Coverage
Post-application period 4 April - 29 May
  • Work on eng-cat to bring it to the same level as en-ca
  • Make pronouns and verbs work (they are currently broken)
~35,000 41.15%/29.34% (en-ca)
1 30 May - 4 June
  • Add new stems to dictionaries
  • Finish the transfer work from en-ca to eng-cat
  • Corpora research to get word frequency list
~38,000
2 5 June - 11 June
  • Add new stems to dictionaries
  • Final testing to get rid of en-ca definitely
~41,000
3 12 June - 18 June
  • Add new stems to dictionaries
  • Begin analysis of translation error patterns to prepare rule priority list
~44,000
4 19 June - 25 June
  • Add new stems to dictionaries
~47,000
5 26 June - 2 July
  • Add new stems to dictionaries
  • Testvoc

Deliverable #1

~50,000
6 3 July - 9 July
  • Add new stems to dictionaries
~53,000
7 10 July - 16 July
  • Add new stems to dictionaries
~56,000
8 17 July - 23 July
  • Add new stems to dictionaries
  • Tagger training
~59,000
9 24 July - 30 July
  • Add new stems to dictionaries
  • Testvoc
  • Finish list of necessary rules by frequency

Deliverable #2

~62,000
10 31 July - 6 August
  • Add new stems to dictionaries
  • Add transfer rules (EN>CA)
~65,000
11 7 August - 13 August
  • Add new stems to dictionaries
  • Add transfer rules (EN>CA)
~67,000
12 14 August - 20 August
  • Add new stems to dictionaries
  • Add transfer rules (CA>EN)
~69,000
13 21 August - 27 August
  • Add new stems to dictionaries
  • Testvoc
  • Write documentation

Final evaluation

~70,000 ~20% ~92%

Final timeline to be determined

Coding challenge

Currently working on it...