Talk:Google Summer of Code/Application 2019
Discussion: Apertium strategy and the GSoC
--Mlforcada (talk) 11:10, 26 January 2019 (CET), 21/01/2019, to be discussed
Does Apertium as a project need more direction? Where would Apertium stand in the cathedral–bazaar line? Most work in Apertium is bazaar-like (people work on the languages and platform features they want) but we have some cathedral-style action such as this elected PMC. We also vote on GSoC projects, etc. But ---and this is mainly my fault as long-standing president--- we have not devoted enough effort to planning. One could say that Apertium basically flies on autopilot, and that, every now and then, some Apertiumers correct the course. I think we need to decide and clarify how to implement our mission (see below). And we need to give ourselves a method to do so.
Desired impact, intended mission. We need to collectively decide what is important and concentrate our efforts there. Our declared mission is laid out in the bylaws as follows:
The mission of the Apertium project is to collaboratively develop free/open-source machine translation for as many languages as possible, and in particular:
- To give everyone free, unlimited access to the best possible machine-translation
- To maintain a modular, documented, open platform for machine translation and other human language processing tasks
- To favour the interchange and reuse of existing linguistic data.
- To make integration with other free/open-source technologies easier.
- To radically guarantee the reproducibility of machine translation and natural language processing research
We have some interesting data collected in our wiki, but I think we need to evaluate impact better. We would need a more complete survey about where Apertium is used (in research, commercially, etc.). I know GSoC does not pay for this, but we do have some money in our kitty (€35,000 or more).
Languages. I believe we need to be more specific, particularly about languages. Our mission does not say much about where to concentrate our effort. Let me give some ideas about different kinds of languages (the list is not exhaustive).
- Languages that do not have (other) machine translation, free/open-source or otherwise, and there is little chance that they will be tackled with the new neural approach, as the parallel corpora available are too small. Nice examples are Occitan, Sardinian or Breton. We do have some Apertium for these languages, but they are different.
- Occitan is now much better connected to French, Catalan, and Spanish, its big sisters in the area, but improving this translator would make it more useful. Post-editable output may be obtained if time is invested in these languages, as they are closely related. The case of Sardinian is similar.
- ftyers: Agree
- Breton→French is a much harder language pair, but it could be used for gisting (French speakers that do not understand Breton). The evolution of its usefulness during development should be evaluated.
- ftyers: The gisting case is also useful yes.
- Occitan is now much better connected to French, Catalan, and Spanish, its big sisters in the area, but improving this translator would make it more useful. Post-editable output may be obtained if time is invested in these languages, as they are closely related. The case of Sardinian is similar.
- Languages that do have commercial machine translation available but are important enough to be tackled by Apertium. I am thinking about languages such as Swahili, Hausa, Gujarati, Igbo, or Yoruba (the Universitat d'Alacant is currently involved in a project called GoURMeT that will build MT for some of these languages. Neural will be hard, as corpora are small. There could be some interesting sinergies). These languages are rather large and served by Google and other MT providers but do not have free/open-source MT. Apertium, as a mission, has to "give everyone free, unlimited access to the best possible machine-translation technologies". When the only MT available for a language are neural or statistical black boxes trained on corpora which are not publicly available, then their language communities are disempowered and vendor-locked. We need to generate free/open-source technology for them, perhaps for gisting purposes.
- ftyers: Tentatively agree, but we should do a careful study of where the best impact can be found before doing e.g. Swahili--English.
- Languages that may be well supported by more than one MT vendor and have good or very good Apertium engines: Catalan–Spanish, Portuguese–Spanish, French–Spanish, etc. Their impact is large because they are widely used. Improving their quality may increase their impact and give Apertium some prestige.
- ftyers: Agree, some of these don't even use the latest code from -lex-tools and -separable, and a substantial amount of work could be done to improve them based on these. Out of the three: Portuguese-Spanish = coverage is the main problem as far as I can tell, we have 15k entries, we need 50k or so, e.g. >95% coverage instead of >90%; French--Spanish = needs more rules, and much better disambiguation.
Engine: I will not be too specific here as I am not that familiar and I may be wrong. For instance we could
- improve the Java engine and, consequently, the Android app (remember it can be used offline, as in the Kurdish language developed by Translators without Borders) and the OmegaT plugin
- ftyers: Here I disagree. I think that having a Java port is a dead end. It is inevitable that it will get out of date very quickly. I think that a better strategy is to (sentence incomplete in Fran's message).
- Tino Didriksen: My very strong opinion about Java is to broadly speaking forget about Java. Maintaining Java ports is quite frankly a waste of time, and such ports would too quickly lag behind the C++ codebases. All the C++ code works cross-platform, also on Android - it just needs scripts to be built and bundled correctly, which is a mostly one-time setup task perfectly suitable for GSoC. What we may need is a JNI layer and better C++ APIs so that Java can call the native backends without spawning separate piped processes - this would also be a good GSoC project. An alternative if one really wants non-native libraries is something like Emscripten that can compile the C++ code to JavaScript, but that's just yet another build target similar to Android.
- improve the CG3 processor (it is too slow now, I gather, and is heavily used in many languages).
- ftyers: About CG3, it is rarely the bottleneck until you get up to thousands of rules. The transfer component is usually a bigger bottleneck, although I would welcome hard facts about that. In any case, I think Apertium is reasonably fast, most of the time.---Improvements are always possible though.
- Tino Didriksen: it's been tried and mostly failed. To get truly explosive performance boosts, the corresponding limitations in rule number and rule complexity one has to live with are stifling. But I'm always open for new ideas, and I hope to be proven wrong - while I don't think CG-3 is too slow, faster is better.
- ftyers:For me, there are a number of things that should be improved in the engine.
- The first is that each component should have the potential to be weighted, and these weights should be able to be learnt jointly. This is a big project probably at least a PhD amount of work, but has the potential to offer very big improvements.
- Modules that have been developed for specific purposes (lexical selection, apertium-separable) should be merged into the main apertium code tree. I might even include lttoolbox there, as people can hardly use it without the apertium package too (because of the deformatters), alternatively move the deformatters to the lttoolbox package.
- The format handling should be reworked, we really need it and it's so close. But unfortunately not something that could be done in a GSOC project by a new Apertiumer.
- More expressive, and more linguist-friendly transfer formalism. The problem is that this is a non-trivial task, and one that is also hard to find money for, when I mentioned it to someone recently, they said "that sounds like something the EU would have funded in the early-2000s".
- An interesting research idea that people have discussed with me would be to have Apertium be able to generate data for neural MT, but with a more expressive formalism. We have done some work on this already, but not with a very generative system. I have some grant ideas about that, e.g. from field linguistics -> NMT, if can expound on them if people are interested.
Evaluation: we have some automatic evaluation (word-error rate, naïve coverage) in our wiki but we should select some language pairs for evaluation. Gap-filling (see this recent paper and references therein) could be used when we expect a langauge pair to be used for gisting. For language pairs which may be used for dissemination, we could pay professional post-editors. The Universitat d'Alacant has experience on this (we had four people post-editing for a week, and we spent about €3000). Tools like PET could be used. But there are also some unexplored usages of Apertium such as interactive machine translation (see also this paper). Things we can spend our money on:
- Gisting evaluation, via gap-filling of selected languages. The technique is mature enough to detect usefulness.
- Post-editing evaluation of selected languages. Comparative where other MT systems are available.
- ftyers: These should also be a comparative evaluation with other systems/engines, such as Mikel Artetxe's monoses. In terms of language coverage, I'd like to make it wider rather than shallower, even if it means that there is less evaluation material. If we have a lot of languages, we could release it/publish it as an MT evaluation corpus, like the newscommentary test set. Another approach would be to define an "Apertium-standard" evaluation, that we could then pay people to do as/when. e.g.
- Naïve coverage, number of entries, top unknown words (by domain)
- Translation quality (postedit WER) 2000 words[?]
- Comparison with other system (postedit WER) 2000 words[?]
- Gap filling success
- Comparison with other system
- Classification of translation errors by module
- ftyers: These should also be a comparative evaluation with other systems/engines, such as Mikel Artetxe's monoses. In terms of language coverage, I'd like to make it wider rather than shallower, even if it means that there is less evaluation material. If we have a lot of languages, we could release it/publish it as an MT evaluation corpus, like the newscommentary test set. Another approach would be to define an "Apertium-standard" evaluation, that we could then pay people to do as/when. e.g.
GSoC proposals. Very specific proposals that should be part of GSoC 2019 if we can find interested people ("Mikel's list"), to be fleshed out.
- Improve the quality of Occitan pairs. Contact Occitanists so that they get involved and spread the word.
- Retake Breton.
- ftyers: Good idea, but I'm not sure/would have to think where to start.
- Start languages in the Google-served, FOSS-orphan set (e.g. Swahili, Hausa, Igbo, Yoruba, Gujarati, etc.)
- ftyers Agree with this.
- Update the functionality of the Java engine. Port a CG3 engine. Port (or adapt existing) HFST engine.
- ftyers: Disagree with this one. Perhaps I would rephrase it "get pipeline for C++ to Android" or "get pipeline for generating from C++ to usage in OmegaT". I would like to add a section on integration/interoperability with other language data. Currently the UD project has freely-licensed annotated corpora for a good many languages in Apertium. Using them in Apertium is problematic because of different tokenisation standards and different tagsets. There should be some effort to make these more interoperable so that we can avoid duplicating effort (having people annotate English text).
- ftyers: *MEETING PROPOSAL*: If many people are going to be in Dublin this year for the MT Summit, how about having an Apertium meeting/workshop where we flesh this stuff out and try and gain some direction? It won't be in time for GSOC this year, but it's definitely nearly ten years since we had a real planning meeting.