Talk:Google Summer of Code/Application 2019
Discussion: Apertium strategy and the GSoC
--Mlforcada (talk) 11:10, 26 January 2019 (CET), 21/01/2019, to be discussed
Does Apertium as a project need more direction? Where would Apertium stand in the cathedral–bazaar line? Most work in Apertium is bazaar-like (people work on the languages and platform features they want) but we have some cathedral-style action such as this elected PMC. We also vote on GSoC projects, etc. But ---and this is mainly my fault as long-standing president--- we have not devoted enough effort to planning. One could say that Apertium basically flies on autopilot, and that, every now and then, some Apertiumers correct the course. I think we need to decide and clarify how to implement our mission (see below). And we need to give ourselves a method to do so.
Desired impact, intended mission. We need to collectively decide what is important and concentrate our efforts there. Our declared mission is laid out in the bylaws as follows:
The mission of the Apertium project is to collaboratively develop free/open-source machine translation for as many languages as possible, and in particular:
- To give everyone free, unlimited access to the best possible machine-translation
- To maintain a modular, documented, open platform for machine translation and other human language processing tasks
- To favour the interchange and reuse of existing linguistic data.
- To make integration with other free/open-source technologies easier.
- To radically guarantee the reproducibility of machine translation and natural language processing research
We have some interesting data collected in our wiki, but I think we need to evaluate impact better. We would need a more complete survey about where Apertium is used (in research, commercially, etc.). I know GSoC does not pay for this, but we do have some money in our kitty (€35,000 or more).
Languages. I believe we need to be more specific, particularly about languages. Our mission does not say much about where to concentrate our effort. Let me give some ideas about different kinds of languages (the list is not exhaustive).
- Languages that do not have (other) machine translation, free/open-source or otherwise, and there is little chance that they will be tackled with the new neural approach, as the parallel corpora available are too small. Nice examples are Occitan, Sardinian or Breton. We do have some Apertium for these languages, but they are different.
- Occitan is now much better connected to French, Catalan, and Spanish, its big sisters in the area, but improving this translator would make it more useful. Post-editable output may be obtained if time is invested in these languages, as they are closely related. The case of Sardinian is similar.
- ftyers: Agree
- Breton→French is a much harder language pair, but it could be used for gisting (French speakers that do not understand Breton). The evolution of its usefulness during development should be evaluated.
- ftyers: The gisting case is also useful yes.
- Occitan is now much better connected to French, Catalan, and Spanish, its big sisters in the area, but improving this translator would make it more useful. Post-editable output may be obtained if time is invested in these languages, as they are closely related. The case of Sardinian is similar.
- Languages that do have commercial machine translation available but are important enough to be tackled by Apertium. I am thinking about languages such as Swahili, Hausa, Gujarati, Igbo, or Yoruba (the Universitat d'Alacant is currently involved in a project called GoURMeT that will build MT for some of these languages. Neural will be hard, as corpora are small. There could be some interesting sinergies). These languages are rather large and served by Google and other MT providers but do not have free/open-source MT. Apertium, as a mission, has to "give everyone free, unlimited access to the best possible machine-translation technologies". When the only MT available for a language are neural or statistical black boxes trained on corpora which are not publicly available, then their language communities are disempowered and vendor-locked. We need to generate free/open-source technology for them, perhaps for gisting purposes.
- ftyers: Tentatively agree, but we should do a careful study of where the best impact can
be found before doing e.g. Swahili--English.
- Languages that may be well supported by more than one MT vendor and have good or very good Apertium engines: Catalan–Spanish, Portuguese–Spanish, French–Spanish, etc. Their impact is large because they are widely used. Improving their quality may increase their impact and give Apertium some prestige.
- ftyers: Agree, some of these don't even use the latest code from -lex-tools
and -separable, and a substantial amount of work could be done to improve them based on these. Out of the three: Portuguese-Spanish = coverage is the main problem as far as I can tell, we have 15k entries, we need 50k or so, e.g. >95% coverage instead of >90%; French--Spanish = needs more rules, and much better disambiguation.
Engine: I will not be too specific here as I am not that familiar and I may be wrong. For instance, one
- improve the Java engine and, consequently, the Android app (remember it can be used offline, as in the Kurdish language developed by Translators without Borders) and the OmegaT plugin
- ftyers: Here I disagree. I think that having a Java port is a dead end. It is
inevitable that it will get out of date very quickly. I think that (sentence incomplete in Fran's message). a better strategy is to
- improve the CG3 processor (it is too slow now, I gather, and is heavily used in many languages).
- ftyers: About CG3, it is rarely the bottleneck until you get up to thousands of rules. The transfer component is usually a bigger bottleneck, although I would welcome hard facts about that. In any case, I think Apertium is reasonably fast, most of the time.---Improvements are always possible though.
- ftyers:For me, there are a number of things that should be improved in the engine.
- The first is that each component should have the potential to be weighted,
and these weights should be able to be learnt jointly. This is a big project probably at least a PhD amount of work, but has the potential to offer very big improvements.
- Modules that have been developed for specific purposes (lexical selection,
apertium-separable) should be merged into the main apertium code tree. I might even include lttoolbox there, as people can hardly use it without the apertium package too (because of the deformatters), alternatively move the deformatters to the lttoolbox package.
- The format handling should be reworked, we really need it and it's so close.
But unfortunately not something that could be done in a GSOC project by a new Apertiumer.
- More expressive, and more linguist-friendly transfer formalism. The problem
is that this is a non-trivial task, and one that is also hard to find money for, when I mentioned it to someone recently, they said "that sounds like something the EU would have funded in the early-2000s".
- An interesting research idea that people have discussed with me would
be to have Apertium be able to generate data for neural MT, but with a more expressive formalism. We have done some work on this already, but not with a very generative system. I have some grant ideas about that, e.g. from field linguistics -> NMT, if can expound on them if people are interested.
Evaluation: we have some automatic evaluation (word-error rate, naïve coverage) in our wiki but we should select some language pairs for evaluation. Gap-filling (see this recent paper and references therein) could be used when we expect a langauge pair to be used for gisting. For language pairs which may be used for dissemination, we could pay professional post-editors. The Universitat d'Alacant has experience on this (we had four people post-editing for a week, and we spent about €3000). Tools like PET could be used. But there are also some unexplored usages of Apertium such as interactive machine translation (see also this paper). Things we can spend our money on:
- Gisting evaluation, via gap-filling of selected languages. The technique is mature enough to detect usefulness.
- Post-editing evaluation of selected languages. Comparative where other MT systems are available.
GSoC proposals. Very specific proposals that should be part of GSoC 2019 if we can find interested people ("Mikel's list"), to be fleshed out.
- Improve the quality of Occitan pairs. Contact Occitanists so that they get involved and spread the word.
- Retake Breton.
- Start languages in the Google-served, FOSS-orphan set (e.g. Swahili, Hausa, Igbo, Yoruba, Gujarati, etc.)
- Update the functionality of the Java engine. Port a CG3 engine. Port (or adapt existing) HFST engine.