Hectoralos/GSOC 2019 proposal: Catalan-Italian and Catalan-Portuguese

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Hèctor Alòs i Font

Location: Shupashkar, Chuvashia, Russia

E-mail: hectoralos@gmail.com

IRC: hector2

GitHub: hectoralos

Telegram: hectoralos

Skype: hectoralos

Why is it you are interested in machine translation?

I’m a sociolinguist. I'm very interested in creating resources for minoritised languages.

Why is it that you are interested in Apertium?

  • Because Apertium is free/open-source software.
  • Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
  • Because there is lot of good work done and being done in it.
  • Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

1.2 Bring a released language pair up to state-of-the-art quality: I'd like to improve the pairs Catalan-Italian and Catalan-Portuguese. In the first case, only the side from Italian to Catalan is available. I plan to create the other side.

My proposal

Title

Improving the Catalan-Italian and Catalan-Portuguese language pairs

Major goals

  • Improving existing translators, obtaining a WER below 15% and a Wikipedia coverage above 91%:
    • Italian to Catalan
    • Portuguese to Catalan
    • Catalan to Portuguese
  • Developing a translator from Catalan to Italian with a WER below 15% and a Wikipedia coverage above 91%

Reasons why Google and Apertium should sponsor it

The pairs Italian-Catalan, Portuguese-Catalan give unsatisfactory results (WER ≈ 30%). Both were published in 2009 and, apparently, no one has worked on them since then. In addition, the translation from Catalan to Italian is not developed. Next are a few data on the three existing directions:

Italian to Catalan

  • bidix: 9091 pairs (without proper names)
  • Word error rate (WER): 29.6% (calculated from random Wikipedia texts with a total of 933 words)

Test file: 'it_wikipedia_ca_ini.txt'
Reference file 'it_wikipedia_ca_fin.txt'

Statistics about input files



Number of words in reference: 1009
Number of words in test: 998
Number of unknown words (marked with a star) in test: 176
Percentage of unknown words: 17.64 %

Results when removing unknown-word marks (stars)



Edit distance: 299
Word error rate (WER): 29.63 %
Number of position-independent correct words: 775
Position-independent word error rate (PER): 23.19 %

Portuguese to Catalan

  • bidix: 7576 pairs (without proper names)
  • Word error rate (WER): 26.9% (calculated from random Wikipedia texts with a total of 1648 words)

Test file: 'pt_wikipedia_ca_ini.txt'
Reference file 'pt_wikipedia_ca_fin.txt'

Statistics about input files



Number of words in reference: 1780
Number of words in test: 1780
Number of unknown words (marked with a star) in test: 294
Percentage of unknown words: 16.52 %

Results when removing unknown-word marks (stars)



Edit distance: 479
Word error rate (WER): 26.91 %
Number of position-independent correct words: 1427
Position-independent word error rate (PER): 19.83 %

Catalan to Portuguese

  • bidix: 7576 pairs (without proper names)
  • coverage: 87.6% (calculated using Wikipedia corpus with 2.6 million words)

Workplan

Week Dates Goals Bidix (excl. PN) WER Coverage
Post-application period 10 March - 26 May
  • Find language resources (Wiktionary et al.)
  • Build frequency lists for Italian-Catalan
  • Build frequency lists for Portuguese-Catalan
  • Construct pending tests for the 4 directions
~9,000 (cat-ita)
~7,500 cat-por
~30% (cat > ita)
~30% (cat > por)
~30% (por > cat)
~89% (cat > ita)
~88% (cat > por)
~88% (por > cat)
1 27 May - 2 June
  • ita > cat
  • Expand bilingual dictionary cat-ita
  • Transfer rules (ita > cat)
~11,000 (cat-ita) ~90% (ita > cat)
2 3 June- 9 June
  • Expand bilingual dictionary cat-ita
  • Transfer rules (ita > cat)
  • Testvoc: adj
~13,000 (cat-ita) ~90.5% (ita > cat)
3 10 June - 16 June
  • Expand bilingual dictionary cat-ita
  • Transfer rules (ita > cat)
~14,000 (cat-ita) <20% (ita > cat) ~91% (ita > cat)
4 17 June - 23 June
  • cat > ita
  • Expand bilingual dictionary cat-ita
  • Transfer rules (cat > ita)
  • Testvoc cat-ita: closed categories
~15,000 (cat-ita) ~90.0% (cat > ita)
~91.5% (ita > cat)
5 24 June - 30 June
  • Expand bilingual dictionary cat-ita
  • Transfer rules (cat > ita)
  • Testvoc cat-ita: vblex

First evaluation (28 June)

~16,000 (cat-ita) ~90.5% (cat > ita)
~91.8% (ita > cat)
6 1 July - 7 July
  • Expand bilingual dictionary cat-ita
  • Transfer rules (cat > ita)
  • Testvoc cat-ita: adj, adv, np
~17,000 (cat-ita) ~91.0% (cat > ita)
~92.0% (ita > cat)
7 8 June - 14 July
  • Transfer rules (cat > ita)
  • Testvoc cat-ita: n
  • Write documentation
~18,000 (cat-ita) <15% (cat > ita)
<15% (ita > cat)
~91.5% (cat > ita)
~92.0% (ita > cat)
8 15 July - 21 July
  • por > cat
  • Expand bilingual dictionary cat-por
  • Disambiguation rules (por > cat)
  • Transfer rules (por > cat)
~9,500 (cat-por) ~88% (por > cat)
9 22 July - 28 July
  • Expand bilingual dictionary
  • Disambiguation rules (por > cat)
  • Transfer rules (por > cat)

Second evaluation (26 July)

~11,500 (cat-por) ~89% (por > cat)
10 29 July - 4 August
  • Expand bilingual dictionary
  • Disambiguation rules (por > cat)
  • Transfer rules (por > cat)
~13,000 (cat-por) <20% (por > cat) ~90% (por > cat)
11 5 August - 11 August
  • cat > por
  • Expand bilingual dictionary
  • Transfer rules (cat > por)
  • Testvoc cat-por: closed categories, vblex
~14,500 (cat-por) ~90.5% (por > cat)
~90.0% (por > cat)
12 12 August - 18 August
  • Expand bilingual dictionary
  • Transfer rules (cat > por)
  • Testvoc cat-por: adj, adv, np
~16,000 (cat-por) ~90.8% (por > cat)
~90.5% (por > cat)
13 18 August - 25 August
  • Expand bilingual dictionary
  • Transfer rules (cat > por)
  • Testvoc cat-por: n

Final evaluation (26 August)

~17,000 (cat-por) <15% (cat > por)
<15% (por > cat)
~91.0% (por > cat)
~91.0% (por > cat)

List your skills and give evidence of your qualifications

I once got a computer engineering (Universitat Politècnica de Catalunya, 1988), but I’ve forgotten almost everything on programming. I also got a BA on linguistics (Universitat de Barcelona, 2008). I’ve been working on Apertium for several years. In 2011 I created the Esperanto-French pair and prepared new releases of the Esperanto-Catalan and Esperanto-Spanish pairs. I’ve also been working on new releases of the French-Catalan pair (2017, 2018 and currently a third one is been packaged). I’ve mentored the GSoC projects on Italian-Sardinian (2016), Catalan-Sardinian (2017) and French-Occitan (2018). In all the cases we released new language pairs just after the end of the GSoC.

Catalan is my mother tongue, and I’ve been studying it at the university. I'm a fluent speaker of Spanish and French. I read fluently in Italian and Portuguese, but my knowledge of them is passive and linguistic. That’s why I’ll work mainly translating from Italian and Portuguese into Catalan, but my knowledge of Italian is good enough to create a first version of a translator from Catalan to Italian.

List any non-Summer-of-Code plans you have for the Summer

I can guarantee at least 30 hours per week of work during the whole Summer.

I have to submit the last two short works for two subjects of the master I am studying on June 5 and 11. It is possible that I will travel on vacation with my family one or two weeks in July. In the unlikely event that one week I do not reach 30 hours of work, I would recover these hours (I have had this workload for other years and there have been no problems).

Coding Challenge