Difference between revisions of "User:Marcriera/proposal"
(7 intermediate revisions by the same user not shown) | |||
Line 24: | Line 24: | ||
== Which of the published tasks are you interested in? What do you plan to do? == |
== Which of the published tasks are you interested in? What do you plan to do? == |
||
Currently, there is an English-Catalan language pair in trunk. However, this pair uses its own monolingual dictionaries, which makes future development more difficult. The aim of this project is to migrate the changes from this old pair to the new one under development in order to get rid of the en-ca pair, and then proceed to expand the dictionaries and transfer rules of the new eng-cat pair. For this purpose, featured Wikipedia articles and public domain books will be used. |
Currently, there is an English-Catalan language pair in trunk. However, this pair uses its own monolingual dictionaries, which makes future development more difficult. The aim of this project is to migrate the changes from this old pair to the new one under development in order to get rid of the en-ca pair, and then proceed to expand the dictionaries and transfer rules of the new eng-cat pair. For this purpose, featured Wikipedia articles and public domain books will be used. However, as the existing transfer rules lack any kind of organization, code refactoring will be absolutely necessary before proceeding to expand it. |
||
Apertium now provides an English-Catalan language pair that has been developed enough to allow for assimilation (to a certain extent), but it is still very far from allowing dissemination. Furthermore, translation from Catalan to English still tends to fail and prevents proper assimilation for potential users. Therefore, even if the main purpose of this proposal is to improve the language pair in the EN>CA direction, there will be some rules added to make it work better in the opposite direction too. |
|||
Tagger training should not be necessary (at least in English), but depending on the results when translating from Catalan to English, it may be required. For this reason, there is time assigned to this task in the work plan. The same applies to constraint grammar; most of the new rules will be transfer or lexical selection rules, but some CG rules might be needed. |
|||
Coverage will be calculated based on Wikipedia. Due to the enormous size of the English version, a word frequency list will be generated from a big part of Wikipedia (at least 100 million words), and coverage will be tested on it. This will allow more efficient token testing. |
|||
=== Title === |
=== Title === |
||
''' |
'''Adopting English-Catalan language pair to bring it close to state-of-the-art quality''' |
||
=== Reasons why Google and Apertium should sponsor it === |
=== Reasons why Google and Apertium should sponsor it === |
||
While there is already an English-Catalan language pair, the quality of the translations can be improved a lot. While other machine translators already offer English-Catalan translation with good results, Apertium has the advantage of being a free and open-source project. This idea matches that of other internet projects such as Wikipedia, which could make use of the Apertium project by default to improve the quality of the translations and reduce post-edition |
While there is already an English-Catalan language pair, the quality of the translations can be improved a lot. While other machine translators already offer English-Catalan translation with good results, Apertium has the advantage of being a free and open-source project. This idea matches that of other internet projects such as Wikipedia, which could make use of the Apertium project by default to improve the quality of the translations and reduce post-edition. This is specially the case of specialized texts, which are superior in Apertium than in statistical machine translators in pairs such as en-es ([http://www.elespanol.com/ciencia/tecnologia/20160926/158484429_0.html]). |
||
=== How and who it will benefit in society === |
=== How and who it will benefit in society === |
||
Line 44: | Line 50: | ||
=== List any non-Summer-of-Code plans you have for the Summer === |
=== List any non-Summer-of-Code plans you have for the Summer === |
||
The first |
The first week of June I have final exams, so I will only be able to work around 20 hours that week. After that date, I will be able to spend at least 30 hours a week on Apertium. I will do some extra hours during the first month to compensate for the hours "lost" during the exam period. |
||
== My plan == |
== My plan == |
||
Line 52: | Line 58: | ||
While the most important goal is to merge both (old and new) language pairs, most of the work during the summer will be related to dictionaries and transfer rules: |
While the most important goal is to merge both (old and new) language pairs, most of the work during the summer will be related to dictionaries and transfer rules: |
||
* |
* Decent WER (~32%) |
||
* |
* Good coverage (~90%) |
||
* Testvoc clean |
* Testvoc clean |
||
* New stems in bidix (~ |
* New stems in bidix (~2000 stems a week) |
||
* Old rule refactoring |
|||
* Additional transfer rules, lexical selection rules and, if necessary, CG. |
* Additional transfer rules, lexical selection rules and, if necessary, CG. |
||
Line 73: | Line 80: | ||
* Work on eng-cat to bring it to the same level as en-ca |
* Work on eng-cat to bring it to the same level as en-ca |
||
* Make pronouns and verbs work (they are currently broken) |
* Make pronouns and verbs work (they are currently broken) |
||
* Corpora research to get word frequency list |
|||
* Prepare word lists for semi-automatic addition to dictionaries |
|||
* Write documentation about the current state of transfer rules |
|||
| style="text-align:center" | ~35,000 |
| style="text-align:center" | ~35,000 |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
41.15%/29.34% (en-ca) |
|||
| |
|||
47.63%/38.5% (eng-cat) |
|||
| style="text-align:center" | ~85.9% |
|||
|- |
|- |
||
! 1 |
! 1 |
||
Line 82: | Line 94: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Finish the transfer work from en-ca to eng-cat |
* Finish the transfer work from en-ca to eng-cat |
||
* Write documentation about the current state of transfer rules |
|||
* Corpora research to get word frequency list |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~37,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~45% |
||
| style="text-align:center" | ~86.3% |
|||
| |
|||
|- |
|- |
||
! 2 |
! 2 |
||
Line 92: | Line 104: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Final testing to get rid of en-ca definitely |
* Final testing to get rid of en-ca definitely |
||
* Write documentation about the current state of transfer rules |
|||
| style="text-align:center" | ~41,000 |
|||
| style="text-align:center" | |
| style="text-align:center" | ~39,000 |
||
| style="text-align:center" | ~43.5% |
|||
| |
|||
| style="text-align:center" | ~86.7% |
|||
|- |
|- |
||
! 3 |
! 3 |
||
Line 100: | Line 113: | ||
| |
| |
||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Transfer rule refactoring |
|||
* Begin analysis of translation error patterns to prepare rule priority list |
|||
| style="text-align:center" | ~ |
| style="text-align:center" | ~41,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~42% |
||
| style="text-align:center" | ~87.1% |
|||
| |
|||
|- |
|- |
||
! 4 |
! 4 |
||
Line 109: | Line 122: | ||
| |
| |
||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Improve semi-automatisation when adding new stems (specially proper nouns) |
|||
| style="text-align:center" | ~47,000 |
|||
| style="text-align:center" | |
| style="text-align:center" | ~43,000 |
||
| style="text-align:center" | ~40.5% |
|||
| |
|||
| style="text-align:center" | ~87.5% |
|||
|- |
|- |
||
! 5 |
! 5 |
||
Line 118: | Line 132: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Testvoc |
* Testvoc |
||
* Transfer rule refactoring |
|||
'''Deliverable #1''' |
'''Deliverable #1''' |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~45,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~39% |
||
| style="text-align:center" | ~87.8% |
|||
| |
|||
|- |
|- |
||
! 6 |
! 6 |
||
Line 127: | Line 142: | ||
| |
| |
||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Begin analysis of translation error patterns to prepare rule priority list |
|||
| style="text-align:center" | ~53,000 |
|||
* Transfer rule refactoring |
|||
| style="text-align:center" | |
|||
| style="text-align:center" | ~47,000 |
|||
| |
|||
| style="text-align:center" | ~38% |
|||
| style="text-align:center" | ~88.1% |
|||
|- |
|- |
||
! 7 |
! 7 |
||
Line 135: | Line 152: | ||
| |
| |
||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~49,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~37% |
||
| style="text-align:center" | ~88.5% |
|||
| |
|||
|- |
|- |
||
! 8 |
! 8 |
||
Line 144: | Line 161: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Tagger training (if necessary) |
* Tagger training (if necessary) |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~51,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~36% |
||
| style="text-align:center" | ~88.8% |
|||
| |
|||
|- |
|- |
||
! 9 |
! 9 |
||
Line 155: | Line 172: | ||
* Finish list of necessary rules by frequency |
* Finish list of necessary rules by frequency |
||
'''Deliverable #2''' |
'''Deliverable #2''' |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~53,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~35% |
||
| style="text-align:center" | ~89.1% |
|||
| |
|||
|- |
|- |
||
! 10 |
! 10 |
||
Line 164: | Line 181: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Add transfer rules (EN>CA) |
* Add transfer rules (EN>CA) |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~55,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~34% |
||
| style="text-align:center" | ~89.4% |
|||
| |
|||
|- |
|- |
||
! 11 |
! 11 |
||
Line 173: | Line 190: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Add transfer rules (EN>CA) |
* Add transfer rules (EN>CA) |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~56,500 |
||
| style="text-align:center" | |
| style="text-align:center" | ~33.5% |
||
| style="text-align:center" | ~89.6% |
|||
|- |
|- |
||
! 12 |
! 12 |
||
Line 182: | Line 199: | ||
* Add new stems to dictionaries |
* Add new stems to dictionaries |
||
* Add transfer rules (CA>EN) |
* Add transfer rules (CA>EN) |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~58,000 |
||
| style="text-align:center" | |
| style="text-align:center" | ~33% |
||
| style="text-align:center" | ~89.8% |
|||
| |
|||
|- |
|- |
||
! 13 |
! 13 |
||
Line 193: | Line 210: | ||
* Write documentation |
* Write documentation |
||
'''Final evaluation''' |
'''Final evaluation''' |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~59,000 |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~32.5% |
||
| style="text-align:center" | ~ |
| style="text-align:center" | ~89.9% |
||
|} |
|} |
||
Final timeline to be determined |
|||
== Coding challenge == |
== Coding challenge == |
||
During the application period, I decided to test my skills by improving the performance of the en-ca language pair. For this purpose, four 500-word fragments of Wikipedia featured articles were translated using Apertium and post-edited. Two of the texts were later analyzed and new stems and transfer rules were added to improve their translation. (The changes can be seen here [https://patch-diff.githubusercontent.com/raw/xavivars/apertium-en-ca/pull/1.patch].) Finally, the four articles were retranslated and both the old and new translations were evaluated using apertium-eval-translator. The test results of the four texts together (2000 words) were the following: |
|||
Currently working on it... |
|||
Before the changes |
|||
<pre> |
|||
Statistics about input files |
|||
------------------------------------------------------- |
|||
Number of words in reference: 2188 |
|||
Number of words in test: 2056 |
|||
Number of unknown words (marked with a star) in test: 159 |
|||
Percentage of unknown words: 7.73 % |
|||
Results when removing unknown-word marks (stars) |
|||
------------------------------------------------------- |
|||
Edit distance: 946 |
|||
Word error rate (WER): 43.24 % |
|||
Number of position-independent correct words: 1569 |
|||
Position-independent word error rate (PER): 28.29 % |
|||
Results when unknown-word marks (stars) are not removed |
|||
------------------------------------------------------- |
|||
Edit distance: 1006 |
|||
Word Error Rate (WER): 45.98 % |
|||
Number of position-independent correct words: 1502 |
|||
Position-independent word error rate (PER): 31.35 % |
|||
Statistics about the translation of unknown words |
|||
------------------------------------------------------- |
|||
Number of unknown words which were free rides: 60 |
|||
Percentage of unknown words that were free rides: 37.74 % |
|||
</pre> |
|||
After the changes |
|||
<pre> |
|||
Statistics about input files |
|||
------------------------------------------------------- |
|||
Number of words in reference: 2188 |
|||
Number of words in test: 2067 |
|||
Number of unknown words (marked with a star) in test: 86 |
|||
Percentage of unknown words: 4.16 % |
|||
Results when removing unknown-word marks (stars) |
|||
------------------------------------------------------- |
|||
Edit distance: 838 |
|||
Word error rate (WER): 38.30 % |
|||
Number of position-independent correct words: 1660 |
|||
Position-independent word error rate (PER): 24.13 % |
|||
Results when unknown-word marks (stars) are not removed |
|||
------------------------------------------------------- |
|||
Edit distance: 873 |
|||
Word Error Rate (WER): 39.90 % |
|||
Number of position-independent correct words: 1622 |
|||
Position-independent word error rate (PER): 25.87 % |
|||
Statistics about the translation of unknown words |
|||
------------------------------------------------------- |
|||
Number of unknown words which were free rides: 35 |
|||
Percentage of unknown words that were free rides: 40.70 % |
|||
</pre> |
Latest revision as of 13:05, 29 May 2017
Contents
Contact Information[edit]
Name: Marc Riera Irigoyen
Location: Barcelona, Spain
E-mail: marc.riera.irigoyen@gmail.com
IRC: mriera_trad
SourceForge: marcriera
Timezone: UTC+02:00
Why is it you are interested in machine translation?[edit]
Before being interested in machine translation, I was very interested in translation and I decided to study a degree in Translation and Interpreting at university. After learning about computer-assisted translation, I became more and more interested in machine translation not as a replacement of human translation, but rather as a way to improve human translation effectiveness and productivity.
Why is it that you are interested in Apertium?[edit]
The Apertium project is very interesting thanks to its open-sourced nature. It is an opportunity to build not only a robust and fairly good rule-based machine translator, but also to help with the development of translation pairs that would be extremely difficult to be implemented in statistical machine translators, such as minority language pairs. As a native speaker of one of these languages (Catalan), the fact that Apertium can offer better results than other types of translators is very attractive.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Currently, there is an English-Catalan language pair in trunk. However, this pair uses its own monolingual dictionaries, which makes future development more difficult. The aim of this project is to migrate the changes from this old pair to the new one under development in order to get rid of the en-ca pair, and then proceed to expand the dictionaries and transfer rules of the new eng-cat pair. For this purpose, featured Wikipedia articles and public domain books will be used. However, as the existing transfer rules lack any kind of organization, code refactoring will be absolutely necessary before proceeding to expand it.
Apertium now provides an English-Catalan language pair that has been developed enough to allow for assimilation (to a certain extent), but it is still very far from allowing dissemination. Furthermore, translation from Catalan to English still tends to fail and prevents proper assimilation for potential users. Therefore, even if the main purpose of this proposal is to improve the language pair in the EN>CA direction, there will be some rules added to make it work better in the opposite direction too.
Tagger training should not be necessary (at least in English), but depending on the results when translating from Catalan to English, it may be required. For this reason, there is time assigned to this task in the work plan. The same applies to constraint grammar; most of the new rules will be transfer or lexical selection rules, but some CG rules might be needed.
Coverage will be calculated based on Wikipedia. Due to the enormous size of the English version, a word frequency list will be generated from a big part of Wikipedia (at least 100 million words), and coverage will be tested on it. This will allow more efficient token testing.
Title[edit]
Adopting English-Catalan language pair to bring it close to state-of-the-art quality
Reasons why Google and Apertium should sponsor it[edit]
While there is already an English-Catalan language pair, the quality of the translations can be improved a lot. While other machine translators already offer English-Catalan translation with good results, Apertium has the advantage of being a free and open-source project. This idea matches that of other internet projects such as Wikipedia, which could make use of the Apertium project by default to improve the quality of the translations and reduce post-edition. This is specially the case of specialized texts, which are superior in Apertium than in statistical machine translators in pairs such as en-es ([1]).
How and who it will benefit in society[edit]
Catalan is a language with only 10 million speakers, but a very lively one. An improved rule-based machine translator for English and Catalan will allow Catalan speakers to benefit from better English-to-Catalan translations than the current results with statistical translators. The open-source nature of Apertium will hopefully encourage online content creators to start using translations in Catalan or to improve the current ones. English speakers will also benefit from the expanded bidix, and future Apertium developers will be able to further develop the language pair more easily thanks to the unification.
List your skills and give evidence of your qualifications[edit]
My mother languages are Catalan and Spanish, and I can also speak English and Japanese. I am currently studying a degree in Translation and Interpreting. I have collaborated with some open-source software projects as an English to Catalan translator, and during the previous term of the current academic year I translated a book for a book publisher. In addition, since 2015 I am an active programmer and translator in the OpenBVE project, an open source railway simulator. I am an experienced Debian and Fedora user, and I know C# and XML.
List any non-Summer-of-Code plans you have for the Summer[edit]
The first week of June I have final exams, so I will only be able to work around 20 hours that week. After that date, I will be able to spend at least 30 hours a week on Apertium. I will do some extra hours during the first month to compensate for the hours "lost" during the exam period.
My plan[edit]
Major goals[edit]
While the most important goal is to merge both (old and new) language pairs, most of the work during the summer will be related to dictionaries and transfer rules:
- Decent WER (~32%)
- Good coverage (~90%)
- Testvoc clean
- New stems in bidix (~2000 stems a week)
- Old rule refactoring
- Additional transfer rules, lexical selection rules and, if necessary, CG.
Workplan[edit]
Week | Dates | Goals | Bidix | WER / PER | Coverage |
---|---|---|---|---|---|
Post-application period | 4 April - 29 May |
|
~35,000 |
41.15%/29.34% (en-ca) 47.63%/38.5% (eng-cat) |
~85.9% |
1 | 30 May - 4 June |
|
~37,000 | ~45% | ~86.3% |
2 | 5 June - 11 June |
|
~39,000 | ~43.5% | ~86.7% |
3 | 12 June - 18 June |
|
~41,000 | ~42% | ~87.1% |
4 | 19 June - 25 June |
|
~43,000 | ~40.5% | ~87.5% |
5 | 26 June - 2 July |
Deliverable #1 |
~45,000 | ~39% | ~87.8% |
6 | 3 July - 9 July |
|
~47,000 | ~38% | ~88.1% |
7 | 10 July - 16 July |
|
~49,000 | ~37% | ~88.5% |
8 | 17 July - 23 July |
|
~51,000 | ~36% | ~88.8% |
9 | 24 July - 30 July |
Deliverable #2 |
~53,000 | ~35% | ~89.1% |
10 | 31 July - 6 August |
|
~55,000 | ~34% | ~89.4% |
11 | 7 August - 13 August |
|
~56,500 | ~33.5% | ~89.6% |
12 | 14 August - 20 August |
|
~58,000 | ~33% | ~89.8% |
13 | 21 August - 27 August |
Final evaluation |
~59,000 | ~32.5% | ~89.9% |
Coding challenge[edit]
During the application period, I decided to test my skills by improving the performance of the en-ca language pair. For this purpose, four 500-word fragments of Wikipedia featured articles were translated using Apertium and post-edited. Two of the texts were later analyzed and new stems and transfer rules were added to improve their translation. (The changes can be seen here [2].) Finally, the four articles were retranslated and both the old and new translations were evaluated using apertium-eval-translator. The test results of the four texts together (2000 words) were the following:
Before the changes
Statistics about input files ------------------------------------------------------- Number of words in reference: 2188 Number of words in test: 2056 Number of unknown words (marked with a star) in test: 159 Percentage of unknown words: 7.73 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 946 Word error rate (WER): 43.24 % Number of position-independent correct words: 1569 Position-independent word error rate (PER): 28.29 % Results when unknown-word marks (stars) are not removed ------------------------------------------------------- Edit distance: 1006 Word Error Rate (WER): 45.98 % Number of position-independent correct words: 1502 Position-independent word error rate (PER): 31.35 % Statistics about the translation of unknown words ------------------------------------------------------- Number of unknown words which were free rides: 60 Percentage of unknown words that were free rides: 37.74 %
After the changes
Statistics about input files ------------------------------------------------------- Number of words in reference: 2188 Number of words in test: 2067 Number of unknown words (marked with a star) in test: 86 Percentage of unknown words: 4.16 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 838 Word error rate (WER): 38.30 % Number of position-independent correct words: 1660 Position-independent word error rate (PER): 24.13 % Results when unknown-word marks (stars) are not removed ------------------------------------------------------- Edit distance: 873 Word Error Rate (WER): 39.90 % Number of position-independent correct words: 1622 Position-independent word error rate (PER): 25.87 % Statistics about the translation of unknown words ------------------------------------------------------- Number of unknown words which were free rides: 35 Percentage of unknown words that were free rides: 40.70 %