Difference between revisions of "User:Mathematic-alpha/proposal"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
  +
== Contact Information ==
   
  +
'''Name:''' Ngadou Yopa Sylvestre Ronald
   
  +
'''Display name:''' Ngadou Yopa
   
  +
'''Location:''' [https://en.wikipedia.org/wiki/Buea Malingo Street, Buea, Cameroon]
Ngadou Yopa
 
Malingo Street
 
Buea, Cameroon
 
(237) 681-702-945
 
Project: AAdopt an unreleased language pair with a minimal user interface
 
April 2019
 
   
  +
'''E-mail:''' [mailto:yopasylvestre@gmail.com yopasylvestre@gmail.com] ([mailto:mathalpha26@gmail.com mathalpha26@gmail.com])
Name: Ngadou Yopa Sylvestre Ronald
 
  +
IRC Nickname: math-alpha (m-alpha)
 
  +
'''IRC:''' math-alpha (m-alpha)
E-mail address: yopasylvestre@gmail.com (mathalpha26@gmail.com)
 
  +
Website: http://ngadou.me/portfolio
 
Github: https://github.com/math-alpha
+
'''GitHub:''' [https://github.com/math-alpha math-alpha]
  +
Gitlab: https://gitlab.com/mathematic-alpha
 
  +
'''Gitlab:''' [https://gitlab.com/mathematic-alpha mathematic-alpha]
Time Zone: UTC +1:00 (Central Africa)
 
  +
School/Degree: B.Eng. in Computer Engineering, Faculty of Engineering and Technology, Buea, Cameroon
 
  +
'''Telegram:''' [https://t.me/ngadou @ngadou]
  +
  +
'''Website:''' http://ngadou.me/portfolio
  +
  +
'''Time Zone:''' UTC +1:00 (Central Africa)
  +
  +
'''School/Degree:''' B.Eng. in Computer Engineering, Faculty of Engineering and Technology, Buea, Cameroon
 
Expected Graduation Year: December 2021
 
Expected Graduation Year: December 2021
   
  +
== Why is it I am interested in machine translation? ==
WHY IS IT THAT I AM INTERESTED IN MACHINE TRANSLATION?
 
  +
A language is a method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way. For a community to integrate in this evolving world, it needs an interface to communicate with other cultures.
 
  +
A language is a method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way. For a community to integrate into this evolving world, it needs an interface to communicate with other cultures.
 
I study computer engineering and I am highly interested in AI and mathematics. Machine translation is one of the branches of these sciences hence my interest.
 
I study computer engineering and I am highly interested in AI and mathematics. Machine translation is one of the branches of these sciences hence my interest.
  +
WHY IS IT THAT I AM INTERESTED IN APERTIUM?
 
  +
== Why is it that I am interested in Apertium? ==
  +
 
Apertium is an open source rule-based Machine Translation project and one of the very rare organizations working on NLP. I highly appreciate the community and developers who are doing great work in machine translation.
 
Apertium is an open source rule-based Machine Translation project and one of the very rare organizations working on NLP. I highly appreciate the community and developers who are doing great work in machine translation.
   
  +
*Because Apertium is free/open-source software.
THE PUBLISHED TASK I AM INTERESTED IN
 
  +
*Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
Adopt an unreleased language pair : I'd like to develop the pairs Mə̀dʉ̂mbɑ̀-Français which is actually in the nursery plus a minimal user interface.
 
  +
*Because there is a lot of good work done and being done in it.
 
  +
*Because it is not only machine translation, but also free resources that can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.
MY PROPOSAL
 
  +
Title
 
  +
== Which of the published tasks are you interested in? What do you plan to do? ==
  +
  +
'''Adopt an unreleased language pair:''' I'd like to develop the pairs Mə̀dʉ̂mbɑ̀-Français which is actually in the nursery plus a minimal user interface.
  +
  +
== My proposal ==
  +
  +
=== Title ===
 
Adopt an unreleased language pair with a minimal user interface
 
Adopt an unreleased language pair with a minimal user interface
 
Major goals
 
Improving the Mə̀dʉ̂mbɑ̀-Français language pair up to 91 % of publicly available Mə̀dʉ̂mbɑ̀ corpus
 
Mə̀dʉ̂mbɑ̀ to Français
 
Français to Mə̀dʉ̂mbɑ̀
 
Developing a minimal interface for adding words and transfer rules
 
 
 
Reasons why Google and Apertium should sponsor it?
 
There exist many local cultural movements in Africa with the goal of developing language and opening to the world. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon.
 
   
  +
=== Major goals ===
  +
  +
*Improving the Mə̀dʉ̂mbɑ̀-Français language pair up to 91 % of publicly available Mə̀dʉ̂mbɑ̀ corpus
  +
**Mə̀dʉ̂mbɑ̀ to Français
  +
**Français to Mə̀dʉ̂mbɑ̀
  +
*Developing a minimal interface for adding words and transfer rules
  +
  +
Unlike it happens with French, compared to other Romance languages, there are not big structural (syntax) differences between Catalan, Italian and Portuguese. If we improve the morphological disambiguation, add several thousands of words in the dictionaries, introduce lexical selection rules and create some more transfer rules, a low WER can be reached.
  +
  +
During the post-application period I plan to study in more detail the apertium-ambiguous package. Nevertheless as syntax in the three languages is practically the same, I think it would be useful only for introducing phrases. For the same reason, I do not think that apertium-separator would help here (unless it is indeed helpful for e.g. the French-Catalan pair). My current perception is that for the limited time of the GSoC it is better to invest in the expansion of dictionaries, the improvement of morphological disambiguation (especially for Portuguese, but also for Italian) and the introduction of a few more structural transfer rules. In any case I am totally open to suggestions and will be happy to try new modules if my potential mentors consider it preferable.
  +
  +
=== Reasons why Google and Apertium should sponsor it ===
  +
  +
As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI.
  +
There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.
  +
  +
==== Italian to Catalan ====
  +
*bidix: 9091 pairs (excluding proper names)
  +
*Coverage: 81.7% (calculated using a Wikipedia corpus with 3.0 M words)
  +
*Word error rate (WER): 30.0% (calculated using random Wikipedia texts with a total of 933 words)
  +
*Word error rate (WER) using Google Translator: 14.0% (calculated using the same test text)
  +
  +
==== Portuguese to Catalan ====
  +
*bidix: 7576 pairs (excluding proper names)
  +
*Coverage: 84.4% (calculated using a Wikipedia corpus with 3.1 M words)
  +
*Word error rate (WER): 28.4% (calculated using random Wikipedia texts with a total of 1648 words)
  +
*Word error rate (WER) using Google Translator: 21.6% (calculated using the same test text)
  +
  +
==== Catalan to Portuguese ====
  +
*bidix: 7576 pairs (excluding proper names)
  +
*coverage: 87.6% (calculated using a Wikipedia corpus with 2.6 million words)
  +
  +
==== apertium-por ====
  +
*There is not any released language pair working with apertium-por.
  +
*There is a basic morphological disambiguation using CG, but less developed than in apertium-cat, apertium-spa and apertium-fra.
  +
*The dictionary has mainly old-fashioned paradigms for proper names:
  +
**it uses "loc" instead of "top"
  +
**proper names do not have gender and number
  +
**there are a few "cog", but most of them are defined as "ant"
  +
  +
According to my experience, the improvement of the proper names according to the current paradigms style for proper names improves significantly the results in the translations between Romance languages in Wikipedia texts, where they are very numerous. This was a important part of the 2017 English-Catalan GSoC project, and the word lists then created by Marc Riera would help a lot the work for the Portuguese dictionary, where there are only a bit more of 2,000 proper names.
  +
  +
=== Online translations ===
  +
  +
Some efforts have been made by some independent organisations to develop dictionaries for Mə̀dʉ̂mbɑ̀ such as [https://glosbe.com/en/byv Glosbe], [https://translation.babylon-software.com/english/Medumba/ Babylon-Software], [https://resulam.com/ghomala-5/ Resulam] and some more. The problem is they use the "naive approach" in a sense they do not do PoS tagging nor have transfer rules.
  +
  +
  +
=== Workplan ===
  +
  +
{|class="wikitable"
  +
! style="width: 10%" | Week
  +
! style="width: 15%" | Dates
  +
! style="width: 36%" | Goals
  +
! style="width: 13%" | Bidix<br>(excluding<br>proper names)
  +
! style="width: 13%" | WER
  +
! style="width: 13%" | Coverage
  +
|-
  +
! Post-application period
  +
| style="text-align:center" | 10 April - 26 May
  +
|
  +
* Find more language resources (Diktionary et al.)
  +
* Build frequency lists for Italian-Catalan
  +
* Build frequency lists for Portuguese-Catalan
  +
* Construct pending tests for the 4 directions
  +
* Study in more detail [[Using weights for ambiguous rules]]
  +
| style="text-align:center" | ~9,000 (cat-ita)<br>~7,500 (cat-por)
  +
| style="text-align:center" | ~30% (cat > ita)<br>~30% (cat > por)<br>~30% (por > cat)
  +
| style="text-align:center" | ~88% (cat > ita)<br>~82% (ita > cat)<br>~88% (cat > por)<br>~84% (por > cat)
  +
|-
  +
! 1
  +
| style="text-align:center" | 27 May - 2 June
  +
| style="text-align:center" | Improving Mə̀dʉ̂mbɑ̀ monodix<br/>Adding prn, pr, cnj*, basic adv to bidix
  +
| style="text-align:center" | ~6,000 (cat-ita)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~85.5% (ita > cat)
  +
|-
  +
! 2
  +
| style="text-align:center" | 3 June- 9 June
  +
| style="text-align:center" | Adding n, adj, adv to the bidix from the French dictionary
  +
| style="text-align:center" | ~9,000 (cat-ita)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~87.5% (ita > cat)
  +
|-
  +
! 3
  +
| style="text-align:center" | 10 June - 16 June
  +
| style="text-align:center" | Adding vblex to the bidix from the French dictionary<br/>Beginning to add missing words in decreasing order of frequency fra > byv
  +
| style="text-align:center" | ~14,000 (cat-ita)
  +
| style="text-align:center" | <20% (ita > cat)
  +
| style="text-align:center" | ~89% (ita > cat)
  +
|-
  +
! 4
  +
| style="text-align:center" | 17 June - 23 June
  +
| style="text-align:center" | Adding words<br/>Transfer rules fra > oci
  +
| style="text-align:center" | ~15,000 (cat-ita)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~90% (cat > ita)<br>~90% (ita > cat)
  +
|-
  +
! 5
  +
| style="text-align:center" | 24 June - 30 June
  +
| style="text-align:center" | Deliverable #1: Mə̀dʉ̂mbɑ̀ toFrench translator</b> || Adding words
  +
'''First evaluation''' (28 June)
  +
| style="text-align:center" | ~16,000 (cat-ita)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~90.5% (cat > ita)<br>~90.5% (ita > cat)
  +
|-
  +
! 6
  +
| style="text-align:center" | 1 July - 7 July
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~17,000 (cat-ita)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~91% (cat > ita)<br>~91% (ita > cat)
  +
|-
  +
! 7
  +
| style="text-align:center" | 8 June - 14 July
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~18,000 (cat-ita)
  +
| style="text-align:center" | <15% (cat > ita)<br><15% (ita > cat)
  +
| style="text-align:center" | ~91.5% (cat > ita)<br>~91.5% (ita > cat)
  +
|-
  +
! 8
  +
| style="text-align:center" | 15 July - 21 July
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~9,500 (cat-por)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~87% (por > cat)
  +
|-
  +
! 9
  +
| style="text-align:center" | 22 July - 28 July
  +
| style="text-align:center" |
  +
'''Second evaluation''' (26 July)
  +
| style="text-align:center" | ~11,500 (cat-por)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~89% (por > cat)
  +
|-
  +
! 10
  +
| style="text-align:center" | 29 July - 4 August
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~13,000 (cat-por)
  +
| style="text-align:center" | <20% (por > cat)
  +
| style="text-align:center" | ~89.5% (por > cat)
  +
|-
  +
! 11
  +
| style="text-align:center" | 5 August - 11 August
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~14,500 (cat-por)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~90% (cat > por)<br>~90% (por > cat)
  +
|-
  +
! 12
  +
| style="text-align:center" | 12 August - 18 August
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~16,000 (cat-por)
  +
| style="text-align:center" |
  +
| style="text-align:center" | ~90.5% (cat > por)<br>~90.5% (por > cat)
  +
|-
  +
! 13
  +
| style="text-align:center" | 19 August - 25 August
  +
| style="text-align:center" |
  +
'''Final evaluation''' (26 August)
  +
| style="text-align:center" | ~17,000 (cat-por)
  +
| style="text-align:center" | <15% (cat > por)<br><15% (por > cat)
  +
| style="text-align:center" | ~91.0% (cat > por)<br>~91.0% (por > cat)
  +
|}
  +
  +
{|class=wikitable
  +
! Setmana !! Dates !! Descripció !! Bidix<br/>(sense np)<br/>previst !!(%) Cobertura<br/>prevista !! (%) WER<br/>previst !! Testvoc !! Avaluació !! Bidix<br/>real !! (%) Cobertura<br/>real !! (%) WER !! Err. !! Fet?
  +
|-
  +
| 0 || <b>français > occitan</b> || || ~5 700 || || || || || || || || ||
  +
|-
  +
| 1 || 14 mai&mdash;20 mai || Improving Occitan monodix<br/>Adding prn, pr, cnj*, basic adv to bidix || ~6,000 || ~84,0% || || || || 7643 || 77,1% || || || ½
  +
|-
  +
| 2 || 21 mai&mdash;27 mai || Adding n, adj, adv to the bidix from the French dictionary || ~12,000 || ~86,0% || || || || 12811 || 82,2% || || || ½
  +
|-
  +
| 3 || 28 mai&mdash;3 junh || Adding vblex to the bidix from the French Wictionary<br/>Beginning to add missing words in decreasing order of frequency fra > oci || ~14,000 || ~88.0% || || || || 14452 || 85,1% || || || ½
  +
|-
  +
| 4 || 4 junh&mdash;10 junh || Adding words<br/>Transfer rules fra > oci || ~16,000 || ~89.0% || || || || 16745 || 89,2% || || || ✓
  +
|-
  +
| 5 || <b>11 junh&mdash;15 junh<br>Deliverable #1: French to Occitan translator</b> || Adding words<br/>Transfer rules fra > oci || <b>~18,000</b> || <b>~89.5%</b> || <b>~25%</b> || || || 19897 || 91,1% || (WP) 15,0% || || ✓
  +
|-
  +
| 6 || 18 junh&mdash;24 junh || Adding words<br/>Transfer rules fra > oci<br/>Begin testvoc fra > oci || ~20,000 || ~90.0% || || pr, cnj*, adv, prn, det || || 20581 || 91,5% || (WP) 12,3% || 0 || ✓
  +
|-
  +
| 7 || 25 junh&mdash;1 julhet || Adding words<br/>Transfer rules fra > oci<br/>Testvoc fra > oci || ~21,000 || ~90.5% || || vblex || || 21823 || 91,8% || (Euro- News) 18,0% || 0 || ✓
  +
|-
  +
| 8 || 2 julhet&mdash;8 julhet || Adding words<br/>Transfer rules fra > oci<br/>Testvoc fra > oci || ~22,000 || ~91.0% || || adj || || 22609 || 91,9% || || 0 || ✓
  +
|-
  +
| 9 || <b>9 julhet&mdash;13 julhet<br>Deliverable #2: French to Occitan translator</b> || Transfer rules fra > oci<br/>Testvoc fra > oci || <b>~22,000</b> || <b>~91.0%</b> || <b>~15%</b> || n || || 25045 || 92,1% || (WP) 7,2% || 0 || ✓
  +
|-
  +
| 0 || <b>occitan > français</b> || || ~22,000 || || || || || || || || ||
  +
|-
  +
| 10 || 16 julhet&mdash;22 julhet || Adding missing words in decreasing order of frequency oci > fra<br/>Transfer rules oci > fra<br/>Testvoc oci > fra || ~22,500 || ~88.0% || || pr, cnj*, adv, prn, det || || 25161 || 91,7% || fra>oci (Euro- News) 6,6% || 10 || ✓
  +
|-
  +
| 13 || 23 julhet&mdash;29 julhet || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,000 || ~89.0% || || n, adj || || 25504 || 92,1% || || 1 || ✓
  +
|-
  +
| 11 || 30 julhet&mdash;5 agost || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,500 || ~90.0% || || vblex || || 26908 || oci>fra 92,9% fra>oci 92,3% || fra>oci (WP) 10,0% || 0 || ½
  +
|-
  +
| 12* || 6 agost&mdash;9 agost || Final improvements || || || || || || || || || ||
  +
|-
  +
| 12** || <b>10 agost&mdash;14 agost<br>Deliverable #3: Occitan to French translator</b> || Final evalution|| <b>~23,500</b> || <b>~90.0%</b> || <b>~30%</b> || || || || || || ||
  +
|-
  +
|}
  +
  +
  +
=== List your skills and give evidence of your qualifications ===
  +
  +
I am a level 2 computer engineering student and I have the necessary skills needed to work on a software project.
  +
  +
Mə̀dʉ̂mbɑ̀ is my mother tongue. I am fluent in Français and English (due to the bilingual nature of my country and I was trained in a special bilingual setting). I am also a student of the Kǔm Vʉ̌ Mə̀dʉ̂mbɑ̀ (CEPOM: Comité d'Etude et de Production des Œuvres Bamiléké Mə̀dʉ̂mbɑ̀) hence I have the sufficient skills required for the Mə̀dʉ̂mbɑ̀ language.
  +
I’ve been working on Apertium since 2016 though there have been times of break due to school. In 2016 I created the Mə̀dʉ̂mbɑ̀-French pair which I worked on during GCI 2016 (I was selected as a finalist). I’ve mentored and was strongly involved in the 2018 edition of GCI.
  +
  +
Catalan is my mother tongue, and I’ve been studying it at the university. I'm a fluent speaker of Spanish and French. I read fluently in Italian and Portuguese, among other Romance languages, but my knowledge of them is mainly passive and linguistic. That’s why I’ll work mainly translating from Italian and Portuguese into Catalan, but my knowledge of Italian is good enough to create the first version of a translator from Catalan to Italian.
  +
  +
== List any non-Summer-of-Code plans you have for the Summer ==
  +
  +
I can guarantee at least 70 hours per week of work as from ending June onwards. As I love this kind of work, I'm sure I'll be engaged quite more. Before then, I will be able to commit only 35 hours of work per week due to the second-semester exams.
   
May 27
 
Coding officially begins!
 
June 24 18:00 UTC
 
Mentors and students can begin submitting Phase 1 evaluations
 
June 28 18:00 UTC
 
Phase 1 Evaluation deadline
 
Work Period
 
Students work on their project with guidance from Mentors
 
July 22 18:00 UTC
 
Mentors and students can begin submitting Phase 2 evaluations
 
July 26 18:00 UTC
 
Phase 2 Evaluation deadline
 
Work Period
 
Students continue working on their project with guidance from Mentors
 
August 19 - 26 18:00 UTC
 
Final week: Students submit their final work product and their final mentor evaluation
 
   
 
[[Category:GSoC 2019 student proposals]]
 
[[Category:GSoC 2019 student proposals]]

Revision as of 11:35, 7 April 2019

Contact Information

Name: Ngadou Yopa Sylvestre Ronald

Display name: Ngadou Yopa

Location: Malingo Street, Buea, Cameroon

E-mail: yopasylvestre@gmail.com (mathalpha26@gmail.com)

IRC: math-alpha (m-alpha)

GitHub: math-alpha

Gitlab: mathematic-alpha

Telegram: @ngadou

Website: http://ngadou.me/portfolio

Time Zone: UTC +1:00 (Central Africa)

School/Degree: B.Eng. in Computer Engineering, Faculty of Engineering and Technology, Buea, Cameroon Expected Graduation Year: December 2021

Why is it I am interested in machine translation?

A language is a method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way. For a community to integrate into this evolving world, it needs an interface to communicate with other cultures. I study computer engineering and I am highly interested in AI and mathematics. Machine translation is one of the branches of these sciences hence my interest.

Why is it that I am interested in Apertium?

Apertium is an open source rule-based Machine Translation project and one of the very rare organizations working on NLP. I highly appreciate the community and developers who are doing great work in machine translation.

  • Because Apertium is free/open-source software.
  • Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
  • Because there is a lot of good work done and being done in it.
  • Because it is not only machine translation, but also free resources that can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

Adopt an unreleased language pair: I'd like to develop the pairs Mə̀dʉ̂mbɑ̀-Français which is actually in the nursery plus a minimal user interface.

My proposal

Title

Adopt an unreleased language pair with a minimal user interface

Major goals

  • Improving the Mə̀dʉ̂mbɑ̀-Français language pair up to 91 % of publicly available Mə̀dʉ̂mbɑ̀ corpus
    • Mə̀dʉ̂mbɑ̀ to Français
    • Français to Mə̀dʉ̂mbɑ̀
  • Developing a minimal interface for adding words and transfer rules

Unlike it happens with French, compared to other Romance languages, there are not big structural (syntax) differences between Catalan, Italian and Portuguese. If we improve the morphological disambiguation, add several thousands of words in the dictionaries, introduce lexical selection rules and create some more transfer rules, a low WER can be reached.

During the post-application period I plan to study in more detail the apertium-ambiguous package. Nevertheless as syntax in the three languages is practically the same, I think it would be useful only for introducing phrases. For the same reason, I do not think that apertium-separator would help here (unless it is indeed helpful for e.g. the French-Catalan pair). My current perception is that for the limited time of the GSoC it is better to invest in the expansion of dictionaries, the improvement of morphological disambiguation (especially for Portuguese, but also for Italian) and the introduction of a few more structural transfer rules. In any case I am totally open to suggestions and will be happy to try new modules if my potential mentors consider it preferable.

Reasons why Google and Apertium should sponsor it

As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.

Italian to Catalan

  • bidix: 9091 pairs (excluding proper names)
  • Coverage: 81.7% (calculated using a Wikipedia corpus with 3.0 M words)
  • Word error rate (WER): 30.0% (calculated using random Wikipedia texts with a total of 933 words)
  • Word error rate (WER) using Google Translator: 14.0% (calculated using the same test text)

Portuguese to Catalan

  • bidix: 7576 pairs (excluding proper names)
  • Coverage: 84.4% (calculated using a Wikipedia corpus with 3.1 M words)
  • Word error rate (WER): 28.4% (calculated using random Wikipedia texts with a total of 1648 words)
  • Word error rate (WER) using Google Translator: 21.6% (calculated using the same test text)

Catalan to Portuguese

  • bidix: 7576 pairs (excluding proper names)
  • coverage: 87.6% (calculated using a Wikipedia corpus with 2.6 million words)

apertium-por

  • There is not any released language pair working with apertium-por.
  • There is a basic morphological disambiguation using CG, but less developed than in apertium-cat, apertium-spa and apertium-fra.
  • The dictionary has mainly old-fashioned paradigms for proper names:
    • it uses "loc" instead of "top"
    • proper names do not have gender and number
    • there are a few "cog", but most of them are defined as "ant"

According to my experience, the improvement of the proper names according to the current paradigms style for proper names improves significantly the results in the translations between Romance languages in Wikipedia texts, where they are very numerous. This was a important part of the 2017 English-Catalan GSoC project, and the word lists then created by Marc Riera would help a lot the work for the Portuguese dictionary, where there are only a bit more of 2,000 proper names.

Online translations

Some efforts have been made by some independent organisations to develop dictionaries for Mə̀dʉ̂mbɑ̀ such as Glosbe, Babylon-Software, Resulam and some more. The problem is they use the "naive approach" in a sense they do not do PoS tagging nor have transfer rules.


Workplan

Week Dates Goals Bidix
(excluding
proper names)
WER Coverage
Post-application period 10 April - 26 May
  • Find more language resources (Diktionary et al.)
  • Build frequency lists for Italian-Catalan
  • Build frequency lists for Portuguese-Catalan
  • Construct pending tests for the 4 directions
  • Study in more detail Using weights for ambiguous rules
~9,000 (cat-ita)
~7,500 (cat-por)
~30% (cat > ita)
~30% (cat > por)
~30% (por > cat)
~88% (cat > ita)
~82% (ita > cat)
~88% (cat > por)
~84% (por > cat)
1 27 May - 2 June Improving Mə̀dʉ̂mbɑ̀ monodix
Adding prn, pr, cnj*, basic adv to bidix
~6,000 (cat-ita) ~85.5% (ita > cat)
2 3 June- 9 June Adding n, adj, adv to the bidix from the French dictionary ~9,000 (cat-ita) ~87.5% (ita > cat)
3 10 June - 16 June Adding vblex to the bidix from the French dictionary
Beginning to add missing words in decreasing order of frequency fra > byv
~14,000 (cat-ita) <20% (ita > cat) ~89% (ita > cat)
4 17 June - 23 June Adding words
Transfer rules fra > oci
~15,000 (cat-ita) ~90% (cat > ita)
~90% (ita > cat)
5 24 June - 30 June Deliverable #1: Mə̀dʉ̂mbɑ̀ toFrench translator Adding words

First evaluation (28 June)

~16,000 (cat-ita) ~90.5% (cat > ita)
~90.5% (ita > cat)
6 1 July - 7 July ~17,000 (cat-ita) ~91% (cat > ita)
~91% (ita > cat)
7 8 June - 14 July ~18,000 (cat-ita) <15% (cat > ita)
<15% (ita > cat)
~91.5% (cat > ita)
~91.5% (ita > cat)
8 15 July - 21 July ~9,500 (cat-por) ~87% (por > cat)
9 22 July - 28 July

Second evaluation (26 July)

~11,500 (cat-por) ~89% (por > cat)
10 29 July - 4 August ~13,000 (cat-por) <20% (por > cat) ~89.5% (por > cat)
11 5 August - 11 August ~14,500 (cat-por) ~90% (cat > por)
~90% (por > cat)
12 12 August - 18 August ~16,000 (cat-por) ~90.5% (cat > por)
~90.5% (por > cat)
13 19 August - 25 August

Final evaluation (26 August)

~17,000 (cat-por) <15% (cat > por)
<15% (por > cat)
~91.0% (cat > por)
~91.0% (por > cat)
Setmana Dates Descripció Bidix
(sense np)
previst
(%) Cobertura
prevista
(%) WER
previst
Testvoc Avaluació Bidix
real
(%) Cobertura
real
(%) WER Err. Fet?
0 français > occitan ~5 700
1 14 mai—20 mai Improving Occitan monodix
Adding prn, pr, cnj*, basic adv to bidix
~6,000 ~84,0% 7643 77,1% ½
2 21 mai—27 mai Adding n, adj, adv to the bidix from the French dictionary ~12,000 ~86,0% 12811 82,2% ½
3 28 mai—3 junh Adding vblex to the bidix from the French Wictionary
Beginning to add missing words in decreasing order of frequency fra > oci
~14,000 ~88.0% 14452 85,1% ½
4 4 junh—10 junh Adding words
Transfer rules fra > oci
~16,000 ~89.0% 16745 89,2%
5 11 junh—15 junh
Deliverable #1: French to Occitan translator
Adding words
Transfer rules fra > oci
~18,000 ~89.5% ~25% 19897 91,1% (WP) 15,0%
6 18 junh—24 junh Adding words
Transfer rules fra > oci
Begin testvoc fra > oci
~20,000 ~90.0% pr, cnj*, adv, prn, det 20581 91,5% (WP) 12,3% 0
7 25 junh—1 julhet Adding words
Transfer rules fra > oci
Testvoc fra > oci
~21,000 ~90.5% vblex 21823 91,8% (Euro- News) 18,0% 0
8 2 julhet—8 julhet Adding words
Transfer rules fra > oci
Testvoc fra > oci
~22,000 ~91.0% adj 22609 91,9% 0
9 9 julhet—13 julhet
Deliverable #2: French to Occitan translator
Transfer rules fra > oci
Testvoc fra > oci
~22,000 ~91.0% ~15% n 25045 92,1% (WP) 7,2% 0
0 occitan > français ~22,000
10 16 julhet—22 julhet Adding missing words in decreasing order of frequency oci > fra
Transfer rules oci > fra
Testvoc oci > fra
~22,500 ~88.0% pr, cnj*, adv, prn, det 25161 91,7% fra>oci (Euro- News) 6,6% 10
13 23 julhet—29 julhet Adding words
Transfer rules oci > fra
Testvoc oci > fra
~23,000 ~89.0% n, adj 25504 92,1% 1
11 30 julhet—5 agost Adding words
Transfer rules oci > fra
Testvoc oci > fra
~23,500 ~90.0% vblex 26908 oci>fra 92,9% fra>oci 92,3% fra>oci (WP) 10,0% 0 ½
12* 6 agost—9 agost Final improvements
12** 10 agost—14 agost
Deliverable #3: Occitan to French translator
Final evalution ~23,500 ~90.0% ~30%


List your skills and give evidence of your qualifications

I am a level 2 computer engineering student and I have the necessary skills needed to work on a software project.

Mə̀dʉ̂mbɑ̀ is my mother tongue. I am fluent in Français and English (due to the bilingual nature of my country and I was trained in a special bilingual setting). I am also a student of the Kǔm Vʉ̌ Mə̀dʉ̂mbɑ̀ (CEPOM: Comité d'Etude et de Production des Œuvres Bamiléké Mə̀dʉ̂mbɑ̀) hence I have the sufficient skills required for the Mə̀dʉ̂mbɑ̀ language. I’ve been working on Apertium since 2016 though there have been times of break due to school. In 2016 I created the Mə̀dʉ̂mbɑ̀-French pair which I worked on during GCI 2016 (I was selected as a finalist). I’ve mentored and was strongly involved in the 2018 edition of GCI.

Catalan is my mother tongue, and I’ve been studying it at the university. I'm a fluent speaker of Spanish and French. I read fluently in Italian and Portuguese, among other Romance languages, but my knowledge of them is mainly passive and linguistic. That’s why I’ll work mainly translating from Italian and Portuguese into Catalan, but my knowledge of Italian is good enough to create the first version of a translator from Catalan to Italian.

List any non-Summer-of-Code plans you have for the Summer

I can guarantee at least 70 hours per week of work as from ending June onwards. As I love this kind of work, I'm sure I'll be engaged quite more. Before then, I will be able to commit only 35 hours of work per week due to the second-semester exams.