Hectoralos/GSOC 2019 proposal: Catalan-Italian and Catalan-Portuguese

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Hèctor Alòs i Font

Location: Shupashkar, Chuvashia, Russia

E-mail: hectoralos@gmail.com

IRC: hector2

GitHub: hectoralos

Telegram: hectoralos

Skype: hectoralos

Why is it you are interested in machine translation?

I’m a sociolinguist working on language maintenance and shift. I'm very interested in creating resources for minoritised languages.

Why is it that you are interested in Apertium?

  • Because Apertium is free/open-source software.
  • Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
  • Because there is lot of good work done and being done in it.
  • Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

1.2 Bring a released language pair up to state-of-the-art quality: I'd like to improve the pairs Catalan-Italian and Catalan-Portuguese. In the first case, only the side from Italian to Catalan is available. I plan to create the other side as part of the project.

My proposal

Title

Improving the Catalan-Italian and Catalan-Portuguese language pairs

Major goals

  • Improving the existing translators, obtaining a WER below 15% and a Wikipedia coverage above 91%:
    • Italian to Catalan
    • Portuguese to Catalan
    • Catalan to Portuguese
  • Developing a translator from Catalan to Italian with a WER below 15% and a Wikipedia coverage above 91%.

Unlike it happens with French and Romanian, compared to other Romance languages, there are not big structural (syntax) differences between Catalan, Italian and Portuguese. If we improve the morphological disambiguation, add several thousands of words in the dictionaries, introduce lexical selection rules and create some more transfer rules, a low WER can be reached.

During the post-application period I plan to study in more detail the apertium-ambiguous package. Nevertheless as syntax in the three languages is practically the same, I think it would be useful only for introducing phrases. For the same reason, I do not think that apertium-separator would significantly help here (although it is indeed helpful for e.g. the French-Catalan pair). My current perception is that for the limited time of the GSoC it is better to invest in the expansion of dictionaries, the improvement of morphological disambiguation (especially for Portuguese, but also for Italian) and the introduction of a few more structural transfer rules. In any case, I am totally open to suggestions and will be happy to try new modules if my potential mentors consider it preferable.

Reasons why Google and Apertium should sponsor it

The pairs Italian-Catalan, Portuguese-Catalan give unsatisfactory results (WER ≈ 30%, coverage below 85% in Wikipedia corpora). Both were published in 2009 and, apparently, no one has worked on them since then (except for the porting from ca-it to cat-ita and pt-ca to por-cat). In addition, the translation from Catalan to Italian is not developed.

Current situation of the language pairs

Next are a few data on the three existing directions, in its current state of development on GitHub. It must be taken into account that the versions on the web are still those of the apertium-ca-it and apertium-pt-ca pairs, while I have made the calculations from apertium-cat-ita and apertium-por-cat.

Italian to Catalan

  • bidix: 9091 pairs (excluding proper names)
  • Coverage: 81.7% (calculated using a Wikipedia corpus with 3.0 M words)
  • Word error rate (WER): 30.0% (calculated using random Wikipedia texts with a total of 933 words)
  • Word error rate (WER) using Google Translator: 14.0% (calculated using the same test text)

Portuguese to Catalan

  • bidix: 7576 pairs (excluding proper names)
  • Coverage: 84.4% (calculated using a Wikipedia corpus with 3.1 M words)
  • Word error rate (WER): 28.4% (calculated using random Wikipedia texts with a total of 1648 words)
  • Word error rate (WER) using Google Translator: 21.6% (calculated using the same test text)

Catalan to Portuguese

  • bidix: 7576 pairs (excluding proper names)
  • coverage: 87.6% (calculated using a Wikipedia corpus with 2.6 million words)

apertium-por

  • There is not any released language pair working with apertium-por.
  • There is a basic morphological disambiguation using CG, but much less developed than in apertium-cat, apertium-spa and apertium-fra.
  • The dictionary has mainly old-fashioned paradigms for proper names:
    • it uses "loc" instead of "top"
    • proper names do not have gender and number
    • there are a few "cog", but most of them are defined as "ant"

According to my experience, the improvement of the proper names following the current paradigm style for them in Apertium improves significantly the results in the translations between Romance languages in Wikipedia texts, where they are very numerous. This was an important part of the 2017 English-Catalan GSoC project, and the word lists then created by Marc Riera would help a lot the work for the Portuguese dictionary, where there are only a bit more of 2,000 proper names.

How about Brazilian vs. Iberian portuguese? At the moment I'm not sure if it's supported by apertium-por - Francis Tyers (talk) 11:51, 9 April 2019 (CEST)

apertium-por has some support for both Brazilian and Iberian Portuguese, e.g. "amazónico" vs. "amazônico", but the generation of one or the other is currently available only for es-pt (not for ca-pt or gl-pt). I don't see any big difficulty to add the generation of both varieties of Portuguese in cat-por. Eventually, we could do the same for the 6 (!) varieties of Catalan, although it probably makes less sense --Hèctor Alòs i Font (talk) 12:21, 9 April 2019 (CEST)

Google translation

Google and Yandex propose the three languages in question, so one can translate between them using these services. I have analysed the results of the translation into Catalan from Google (my experience tells me that Yandex gives worse results for Catalan). The numerical results can be seen in the previous sections (ita-cat: 14.0% WER, por-cat: 21.6% WER).

The results are far from wonderful, especially for the Portuguese-Catalan pair. Seemingly, Google translates using both Spanish and English as bridge languages, as can be seen, for example, by words that appear in these two languages in the final text (supposedly in Catalan) and that were not in the original Italian or Portuguese text. The use of English as intermediate between Romance languages causes problems known to all users, such as the translation of p2.pl verb forms with elided subject to p2.sg, the incorrect choice of past times in the verbs and the disappearance of some pronouns. Here is an example of the last case of the Italian test text (randomly obtained):

Original text (bold mine):

altri invece ne hanno apprezzato la spontaneità, la tenacia e l'affettuosità

Google translation:

altres han apreciat la seva espontaneïtat, tenacitat i afecte

Post-edited translation:

altres n'han apreciat l'espontaneïtat, tenacitat i afecte

It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.

Workplan

Week Dates Goals Bidix
(excluding
proper names)
WER Coverage
Post-application period 10 March - 26 May
  • Find language resources (Wiktionary et al.)
  • Build frequency lists for Italian-Catalan
  • Build frequency lists for Portuguese-Catalan
  • Construct pending tests for the 4 directions
  • Study in more detail Using weights for ambiguous rules
Current
situation


~9,000 (cat-ita)
~7,500 (cat-por)
Current
situation


~30% (cat > ita)
~30% (cat > por)
~30% (por > cat)
Current
situation


~88% (cat > ita)
~82% (ita > cat)
~88% (cat > por)
~84% (por > cat)
1 27 May - 2 June
  • ita > cat
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (ita > cat)
~11,000 (cat-ita) ~85.5% (ita > cat)
2 3 June- 9 June
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (ita > cat)
~13,000 (cat-ita) ~87.5% (ita > cat)
3 10 June - 16 June
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (ita > cat)
~14,000 (cat-ita) <20% (ita > cat) ~89% (ita > cat)
4 17 June - 23 June
  • cat > ita
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (cat > ita)
  • Testvoc cat-ita, ita-cat: closed categories
~15,000 (cat-ita) ~90% (cat > ita)
~90% (ita > cat)
5 24 June - 30 June
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (cat > ita)
  • Testvoc cat-ita, ita-cat: vblex

First evaluation (28 June)

~16,000 (cat-ita) ~90.5% (cat > ita)
~90.5% (ita > cat)
6 1 July - 7 July
  • Expand bilingual dictionary cat-ita
  • Transfer and lexical selection rules (cat > ita)
  • Testvoc cat-ita, ita-cat: adj, adv, np
~17,000 (cat-ita) ~91% (cat > ita)
~91% (ita > cat)
7 8 June - 14 July
  • Transfer and lexical selection rules (cat > ita)
  • Testvoc cat-ita, ita-cat: n
  • Write documentation
~18,000 (cat-ita) <15% (cat > ita)
<15% (ita > cat)
~91.5% (cat > ita)
~91.5% (ita > cat)
8 15 July - 21 July
  • por > cat
  • Expand bilingual dictionary cat-por
  • Disambiguation rules (por > cat)
  • Work on Portuguese proper names
~9,500 (cat-por) ~87% (por > cat)
9 22 July - 28 July
  • Expand bilingual dictionary
  • Disambiguation rules (por > cat)
  • Work on Portuguese proper names
  • Transfer and lexical selection rules (por > cat)
  • Testvoc cat-por, por-cat: np

Second evaluation (26 July)

~11,500 (cat-por) ~89% (por > cat)
10 29 July - 4 August
  • Expand bilingual dictionary
  • Disambiguation rules (por > cat)
  • Transfer and lexical selection rules (por > cat)
~13,000 (cat-por) <20% (por > cat) ~89.5% (por > cat)
11 5 August - 11 August
  • cat > por
  • Expand bilingual dictionary
  • Transfer and lexical selection rules (cat > por)
  • Testvoc cat-por, por-cat: closed categories, vblex
~14,500 (cat-por) ~90% (cat > por)
~90% (por > cat)
12 12 August - 18 August
  • Expand bilingual dictionary
  • Transfer and lexical selection rules (cat > por)
  • Testvoc cat-por, por-cat: adj, adv
~16,000 (cat-por) ~90.5% (cat > por)
~90.5% (por > cat)
13 18 August - 25 August
  • Expand bilingual dictionary
  • Transfer and lexical selection rules (cat > por)
  • Testvoc cat-por, por-cat: n

Final evaluation (26 August)

~17,000 (cat-por) <15% (cat > por)
<15% (por > cat)
~91.0% (cat > por)
~91.0% (por > cat)

List your skills and give evidence of your qualifications

I once got a computer engineering (Universitat Politècnica de Catalunya, 1988), but I’ve forgotten almost everything on programming (except writing short Perl scripts). I also got a BA on linguistics (Universitat de Barcelona, 2008). I’ve been working on Apertium for several years. In 2011 I created the Esperanto-French pair and prepared new releases of the Esperanto-Catalan and Esperanto-Spanish pairs. I’ve also been working on new releases of the French-Catalan pair (2017, 2018 and 2019). I’ve mentored and was strongly involved in the GSoC projects on Italian-Sardinian (2016), Catalan-Sardinian (2017) and French-Occitan (2018). In all these cases we released new one-way language pairs just after the end of the GSoC.

Catalan is my mother tongue, and I’ve been studying it at the university. I'm a fluent speaker of Spanish and French. I read fluently in Italian and Portuguese, among other Romance languages, but my knowledge of them is mainly passive and linguistic. That’s why I’ll work mainly translating from Italian and Portuguese into Catalan, but my knowledge of Italian is good enough to create a first version of a translator from Catalan to Italian.

List any non-Summer-of-Code plans you have for the Summer

I can guarantee at least 30 hours per week of work during the whole Summer. As I love this kind of work, I'm sure I'll be engaged quite more.

I have to submit the last two short works for two subjects of the master I am studying on June 5 and 11. It is possible that I will travel on vacation with my family one or two weeks in July. In the unlikely event that one week I do not reach 30 hours of work, I would recover these hours (I have had this workload in other years and there have been no problems).

Coding challenge

I have been working in a short text from the Italian Wikipedia (the beginning of the article on Rome. The initial translation can be found here, while the current one is here.

I have also been working in a short in text in the other side. The source text in Catalan from the Catalan Wikipedia can be found here and the original Wikipedia article here. The current translation is here. (No initial translation is given, since there was little work done in this side, except for the bilingual dictionary)