PMC proposals/Move Apertium to Github

From Apertium
< PMC proposals
Revision as of 14:06, 12 March 2018 by Bech (talk | contribs) (→‎Categories: Other previous discussions about that were in these two categories)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Note that this discussion is happening in apertium-stuff but not in apertium-pmc, where it also should

Summary[edit]

The issue of moving Apertium code partially or wholly to Github has been in debate for a long time. Previous proposals (PMC proposals/Move apertium to github and PMC proposals/Allow some code under github.com/apertium) met with a number of objections and eventually expired. This proposal attempts to address those issues and outline a modern, updated plan to move all Apertium code to Github.

Bech (talk) 12:10, 7 February 2018 (CET) This proposal as previous ones incluses two distint changes in the same package :
  1. Moving from sourceforge.net to github.com
  2. Moving from subversion to git software.
The opportunity of any of two changes could also be examinated separately.

Plan in brief:

  • Individual repos for each pair, language module, and tool (preserving all commit history).
  • A couple of "meta-repos" that contain submodules pointing to collections of repos.
    • e.g. apertium-staging would contain ~8 submodules pointing to each of the pairs in SVN's /staging and apertium-all would have submodules to apertium-staging, apertium-incubator, apertium-languages, etc.
    • Hierarchy maintained automatically via simple scripts and Github API.
    • Simple hierarchical interface on top of Github for easy use. (Sushain's demo version)
  • Resources such as Using git (in progress), the linked tutorials, and the Github tutorial for existing SVN developers.


Proposed by: Shardulc (talk) 00:52, 4 February 2018 (CET)
Seconded by: Sushain (talk) 00:55, 4 February 2018 (CET)

In detail[edit]

As a FOSS project, the main benefits to Apertium are:

  • Github gives Apertium higher visibility: according to people who attended the GSoC mentor summit, people often search for "github apertium" to try to find Apertium's code and are unsuccessful
    • --Mlforcada (talk) 10:09, 4 February 2018 (CET) An apology of laziness?
    • --Sushain (talk) 17:59, 4 February 2018 (CET) To some extent I agree. However, for younger individuals that have "open source === github" in their mind, it makes sense.
  • Github makes it very easy for new people to contribute because:
    • More people outside Apertium are far more familiar with Git vs. SVN (especially younger folks such as GCI/GSoC students)
      Bech (talk) 13:04, 7 February 2018 (CET) To install apertium and one or two language pairs, you (just) have to follow few wiki pages and then, you get the only command line needed to download. One more command line will have to be learned to commit change. And people who do more special things will have to know a total of 4 or 5 subversion commands. Anyway, people who know subversion will have to learn at least as many commands if Apertium is moved to git.
      Shardulc (talk) 19:48, 7 February 2018 (CET) First, if we shift to Github then of course the new commands will be documented on a wiki page. Second, please see the second-from-last objection in the Caveats section.
    • To start contributing, a new user just has to fork the repo and send a pull request (a 'pull request' is a request to merge a patch), without requesting access on the mailing list etc.
      Bech (talk) 13:04, 7 February 2018 (CET) To start contributing to Apertium (or any other free software), there are previous steps :
      1. discovering this software exists
      2. downloading and installing it
      3. using it
      4. finding things to improve, doing these changes on his computer, testing them
      5. giving these changes to the world.
      For step 2, no account is required with subversion as with git. And also no account is required on sourceforge.net as on github.com . It is good like that.
      For step 5, it is normal to allow only people who would do serious work to commit changes, and it does not depend on the storage software used and on the storage provider.
      Shardulc (talk) 19:48, 7 February 2018 (CET) For step 2, git and Github do not require an account to download and install anything.
      For step 5, submitting a pull request does not mean that the code is committed, only that it is sent for review by the contributors who have been involved for a longer time. It is very easy to do this because the contributor only has to click a few buttons on the Github interface. Currently, a new contributor has to:
      1. find and join the mailing list
      2. send an email requesting commit access
      3. wait for the request to be approved (often no confirmation about this is received)
      4. contribute directly to the central repository without review, or manually make a personal copy of the repository
  • Github encourages better quality code because:
    • Github would provide an excellent issue tracker for each pair/language/tool (an example and a description)
      Bech (talk) 18:09, 7 February 2018 (CET) Is it link to using github.com as a repository or to using git software ? Anyway, not sure it will change something for those who just keep a pair on their computer and sometimes make changes and commit them. If github.com or git software has interesting tools, you will have to write documentation to make them used. Having a text file with all the commit comments of a language pair or an apertium tool included may be as much informative 6 months or 2 years later.
      Shardulc (talk) 19:52, 7 February 2018 (CET) The issue tracker is a feature of Github, not git. We will not have to write documentation because Github already has very extensive and complete documentation of these features, much better than SourceForge. It may not make a difference for pairs as you mentioned but it will make a big difference for pairs in active development.
    • Github has excellent pull request review tools (shardulc notes that during GCI 2017, he asked students to make dummy pull requests in dummy Github repositories on at least three different occasions, just because line-by-line code reviews are so good)
    • Each pair/language/tool can have different package maintainers with commit access, so that pull requests are reviewed and merged by people best able to judge their quality, before giving commit access to new contributors
      • --Mlforcada (talk) 10:09, 4 February 2018 (CET) This is *very* important
      Bech (talk) 18:09, 7 February 2018 (CET) Contradiction with the To start contributing ... advantage. There may be sense to give separate commit access for language pairs and for Apertium tools. But a normal human (not coming here just with the intention to destroy) will not spend time on a language pair for which he does not know anything in any of the two languages. In addition, there are presently about 280 language pairs, but this wiki page is very far from reaching 280 lines.
      Shardulc (talk) 20:01, 7 February 2018 (CET) Even though a contributor knows both languages in a pair, that does not mean they will produce the highest quality code! There is always a learning period for software and new contributors especially should not directly commit to the pair, but their changes should be reviewed by an experienced contributor. It is very easy to make changes and submit a pull request because that is all on the personal repository of the contributor, not on the central Apertium repository, until it gets merged.
      Also it is not a good thing that there isn't a maintainer for every package. Moving to Github will ensure that there are one or more people who are responsible for the code because they have written most of it.


Additional benefits:

  • Github provides webhooks for automatic actions, like begiak reporting on new issues; reliable APIs for scripts (more about this in Caveats); and email notifications for repos that you follow.
  • All the benefits of git for those who agree that they are benefits (lightweight and plentiful branches, better support for merging and merge-based workflows, offline commits, complete offline history), and for those who don't agree, there is the Github SVN bridge
  • Granular permissions: all contributors do not have access to literally everything—especially useful for GCI/GSoC students (more about this in Caveats)
  • Github's web interface has a feature set that eclipses SourceForge's interface especially when it comes to navigating code and it receives frequent improvements/updates
    • --Mlforcada (talk) 10:09, 4 February 2018 (CET) It would be nice to give more objective wording for this.
    • --Sushain (talk) 17:59, 4 February 2018 (CET) My apologies; these are mostly my words here. I have revised them to be a bit more objective.
  • Each repo has its own version history and releases
  • Package maintainers can add continuous integration tools or enforce workflows for specific pairs/languages/tools as required; these decisions can be taken independently for each pair/language/tool by contributors involved with it
    • --Mlforcada (talk) 10:09, 4 February 2018 (CET) Please explain.
    • --Sushain (talk) 17:59, 4 February 2018 (CET) GitHub provides support for lots of tooling and settings. For example, I can easily ensure that a repo's master branch doesn't get pull requests merged until status checks pass. These status checks can be anything from code linting, unit/integration tests and coverage tools (i.e. continuous integration). These tools have to be configured at a per repo basis and would be a nightmare for a monorepo. Just consider how large the configuration would get to test everything and many [programming] languages we'd need to install to test all of Apertium's different tools, core, modules, etc. For an example of continuous integration in action, see: https://github.com/goavki/apertium-html-tools/pull/253. There are lots of more complex uses. e.g. a UI library could make it so that every pull request builds a version of the docs with that pulls changes and hosts it online. For an Apertium module, the build tool could be easily configured to emit a simple HTML page artifact that contains a bunch of test phrases that could be verified for accuracy before merging the code (unit tests are sometimes < eyeballs).
    • --Shardulc (talk) 18:59, 4 February 2018 (CET) About "workflows": if all the active contributors to a package agree, then more 'git-like' development workflows could be used. For example, the 'apertium/apertium-xxx' repo has a 'master' branch and a 'development' branch. The repo is forked by each of the developer 'shardulc' who makes lots of feature-specific branches and opens pull requests from 'shardulc/apertium-xxx/new-feature' to 'apertium/apertium-xxx/development'. Finally, every once in a while, the features in 'development' are reviewed and merged into 'master'.


Communicating the change:

  • Thanks to a recent PMC election, we have a list of contributors and email addresses (which is complete as far as we know). This can be used to:
    • announce the change to all contributors
    • provide links to help pages, documentation of the change, etc.
    • provide email addresses and IRC nicks to contact if any further help is needed
    • README in the SVN repo explaining the transition
    • manually changing the 7 references to svn.code.sf.net on the Wiki
      • There are 100s of links, I suggest you use Google instead of Wiki search. - Francis Tyers (talk) 11:29, 13 February 2018 (CET)
  • Limited backwards-compatibility:
    • the SVN repo can be read only: anybody trying to commit will be presented with a message notifying them of this
    • the SVN repo directories will be replaced with svn:externals pointing to the Github repos (see Caveats)


Miscellaneous concerns:

  • Mailing lists: should probably be preserved on SourceForge for now until/unless we choose to switch to another solution or self-host them.
  • Existing issues: Sushain volunteers to manually transpose (or find an automatic solution) to moving our existing issues (which are not too many).

Caveats[edit]

Objection: It is harder to work with the meta-repos. git submodule commands are gnarly.
Response:

  • Aliases and cheatsheets will remedy this effectively.
  • Very few people check out e.g. the entire staging directory at once anyway.


Objection: If X number of pairs are changed at once, it will create X different commits.
Response:

  • Already happens for most developers anyway, who have specific pairs/languages checked out and not the whole tree.
  • (arguable) Is this really a bad thing? Doesn't it make sense that each pair/language/tool can stand on its own with its own history, with no connections to others?
    • It would be nice if pairs/languages were independent, but it's not the reality. If I change, say, apertium-nob's proper noun pardefs to support dashes, I now have to ensure all pairs connected to -nob (nno, swe, dan, sme, …) don't get testvoc problems, and possibly change their transfer rules/bidixes (and in some cases the easiest way is to update all the other other monolingual sides with the same change). OTOH, I realise most people don't care. --unhammer (talk) 12:44, 9 February 2018 (CET)


Objection: Selective granular permissions are Bureaucratic and Bad.
Response:

  • They are not compulsory. If wanted, everyone could have access to everything with "organization permissions" instead of "repository permissions", but this is not desirable because:
  • Most developers work with specific languages and pairs and do not need access to everything. A single compromised account should not threaten all the code.


Objection: How will the meta-repos be kept up-to-date with the latest versions of pairs/languages/tools?
Response:

  • With scripts! Github provides a clean, reliable API for doing actions like these. The required scripts are very simple.
  • Sushain is willing to write the scripts and Tino is willing to host them (and perhaps do code review). The scripts themselves can be on Github so that if required, others can maintain them.


Objection: GitHub doesn’t provide a nice interface to view repos in a tree format like SourceForge.
Response:

  • (as mentioned in summary) Sushain will! See this demo page which is a simple, elegant, single HTML file.
  • The page is automatically generated from tags on repos which makes it trivial to extend to all existing repos and new ones.
  • As before, this code can be on Github so others can review and maintain it if required.


Objection: Existing developers will be inconvenienced.
Response:

  • (as mentioned in summary) Resources such as Using git (in progress), the linked tutorials, and the Github tutorial can help existing developers.
  • (as mentioned in details) The Github SVN bridge can be used by those not comfortable with git.
  • In the long run, 'new' developers will outnumber the current existing developers.
  • For backwards-compatibility, the SVN repo can be populated with svn:externals that point to Github's SVN bridge. Anyone checking out SVN repos will be effectively using the Github SVN bridge instead.


Objection: There will be ~500 repositories under Apertium!
Response:

  • Nobody will have to navigate those repositories directly. An interface like Sushain's demo interface solves that problem.
  • The philosophy of git and Github support large numbers of repositories, each serving distinct purposes, over the alternative.
  • It is better than having one single repository for all the code, or single repositories for each subtree, etc. for many reasons as mentioned in Details.


Objection: We will loose the counter of downloads. This information is useful to show the importance of Apertium when asking for public money to fund new developments.
Response:

  • SourceForge download stats were never representative of actual users. Instead, we have package download stats and Debian popcon. Tino Didriksen (talk) 11:47, 15 February 2018 (CET)
  • Looks like that it's possible to get the number of downloads for each release via the GitHub API. If the latest count of downloads on SourceForge is kept in a file, maybe it would be possible to show the sum of these two on the Sushain's repo interface? selimcan (talk) 22:33, 13 February 2018 (CET)
  • Shardulc (talk) 01:30, 14 February 2018 (CET) Yes, as selimcan said, it is possible to get the number of downloads for each release (an example) via the API and it would be easy to display such counts on the web interface too. A few details are worth mentioning:
    • Unfortunately, the download count is not public as on SourceForge.
    • The download count is not for a release, but for an asset of the release. For example, if a release contains a .tar.gz file and also a .zip file, then the download counts for those two will be separately provided.
    • Apart from releases, GitHub publicly displays traffic for cloning and page visits.

Comments[edit]

  • Some of the pro's look like con's to me. I don't quite see the big win for language data, since we have git-svn, but I would very much like having the core tools (lttoolbox, apertium, …) in git. OTOH, having some tools on github and some in Sourceforge SVN is perhaps confusing to people. There are enough problems with the current setup that I'm fine either way. --unhammer (talk) 18:45, 9 February 2018 (CET)

Non-PMC signatories:

Voting[edit]

Agree[edit]

  • Tino Didriksen (talk) 13:32, 4 February 2018 (CET)
  • Xavi Ivars (talk) 14:04, 4 February 2018 (CET)
  • Firespeaker (talk) 17:28, 4 February 2018 (CET)
    • Mikel's comments and questions are good—I'd like to see the questions addressed, but am otherwise on board. Also, would it make sense to host the repository organising page on github.io? I think I saw something like this recently that some other org had set up, but I don't remember where.
      • This would work. I'm happy to set that up. Presumably the repo with this page would also have the scripts to keep the submodules up-to-date, etc. We can have a redirect or reverse proxy of it from the apertium.org site. I have also responded to Mikel's comments. -- Sushain (talk) 17:45, 4 February 2018 (CET)
      • I also think the github.io page is a good idea. Github Pages makes it so that the organization 'apertium' gets the domain 'apertium.github.io' to host anything we want. Also, we can have unlimited pages for 'projects', such as 'apertium.github.io/tools' if needed. Shardulc (talk) 18:50, 4 February 2018 (CET)
  • Mikel Forcada

Disagree[edit]

Abstain[edit]