Talk:Google Summer of Code/Application 2019

From Apertium
Jump to navigation Jump to search

Discussion: Apertium strategy and the GSoC

--Mlforcada (talk) 11:10, 26 January 2019 (CET), 21/01/2019, to be discussed

Does Apertium as a project need more direction? Where would Apertium stand in the cathedral–bazaar line? Most work in Apertium is bazaar-like (people work on the languages and platform features they want) but we have some cathedral-style action such as this elected PMC. We also vote on GSoC projects, etc. But ---and this is mainly my fault as long-standing president--- we have not devoted enough effort to planning. One could say that Apertium basically flies on autopilot, and that, every now and then, some Apertiumers correct the course. I think we need to decide and clarify how to implement our mission (see below). And we need to give ourselves a method to do so.

Desired impact, intended mission. We need to collectively decide what is important and concentrate our efforts there. Our declared mission is laid out in the bylaws as follows:

The mission of the Apertium project is to collaboratively develop free/open-source machine translation for as many languages as possible, and in particular:

  • To give everyone free, unlimited access to the best possible machine-translation
  • To maintain a modular, documented, open platform for machine translation and other human language processing tasks
  • To favour the interchange and reuse of existing linguistic data.
  • To make integration with other free/open-source technologies easier.
  • To radically guarantee the reproducibility of machine translation and natural language processing research

We have some interesting data collected in our wiki, but I think we need to evaluate impact better. We would need a more complete survey about where Apertium is used (in research, commercially, etc.). I know GSoC does not pay for this, but we do have some money in our kitty (€35,000 or more).

Languages. I believe we need to be more specific, particularly about languages. Our mission does not say much about where to concentrate our effort. Let me give some ideas about different kinds of languages (the list is not exhaustive).

  1. Languages that do not have (other) machine translation, free/open-source or otherwise, and there is little chance that they will be tackled with the new neural approach, as the parallel corpora available are too small. Nice examples are Occitan, Sardinian or Breton. We do have some Apertium for these languages, but they are different.
    1. Occitan is now much better connected to French, Catalan, and Spanish, its big sisters in the area, but improving this translator would make it more useful. Post-editable output may be obtained if time is invested in these languages, as they are closely related. The case of Sardinian is similar.
    2. Breton→French is a much harder language pair, but it could be used for gisting (French speakers that do not understand Breton). The evolution of its usefulness during development should be evaluated.
  2. Languages that do have commercial machine translation available but are important enough to be tackled by Apertium. I am thinking about languages such as Swahili, Hausa, Gujarati, Igbo, or Yoruba (the Universitat d'Alacant is currently involved in a project called GoURMeT that will build MT for some of these languages. Neural will be hard, as corpora are small. There could be some interesting sinergies). These languages are rather large and served by Google and other MT providers but do not have free/open-source MT. Apertium, as a mission, has to "give everyone free, unlimited access to the best possible machine-translation technologies". When the only MT available for a language are neural or statistical black boxes trained on corpora which are not publicly available, then their language communities are disempowered and vendor-locked. We need to generate free/open-source technology for them, perhaps for gisting purposes.
  3. Languages that may be well supported by more than one MT vendor and have good or very good Apertium engines: Catalan–Spanish, Portuguese–Spanish, French–Spanish, etc. Their impact is large because they are widely used. Improving their quality may increase their impact and give Apertium some prestige.

Engine: I will not be too specific here as I am not that familiar and I may be wrong. For instance, one

  1. improve the Java engine and, consequently, the Android app (remember it can be used offline, as in the Kurdish language developed by Translators without Borders) and the OmegaT plugin
  2. improve the CG3 processor (it is too slow now, I gather, and is heavily used in many languages).

Evaluation: we have some automatic evaluation (word-error rate, naïve coverage) in our wiki but we should select some language pairs for evaluation. Gap-filling (see this recent paper and references therein) could be used when we expect a langauge pair to be used for gisting. For language pairs which may be used for dissemination, we could pay professional post-editors. The Universitat d'Alacant has experience on this (we had four people post-editing for a week, and we spent about €3000). Tools like PET could be used. But there are also some unexplored usages of Apertium such as interactive machine translation (see also this paper). Things we can spend our money on:

  1. Gisting evaluation, via gap-filling of selected languages. The technique is mature enough to detect usefulness.
  2. Post-editing evaluation of selected languages. Comparative where other MT systems are available.

GSoC proposals. Very specific proposals that should be part of GSoC 2019 if we can find interested people ("Mikel's list"), to be fleshed out.

  1. Improve the quality of Occitan pairs. Contact Occitanists so that they get involved and spread the word.
  2. Retake Breton.
  3. Start languages in the Google-served, FOSS-orphan set (e.g. Swahili, Hausa, Igbo, Yoruba, Gujarati, etc.)
  4. Update the functionality of the Java engine. Port a CG3 engine. Port (or adapt existing) HFST engine.