Talk:PMC proposals/Allow some code under github.com/apertium

From Apertium
Jump to navigation Jump to search

Inventory of things currently in svn

If done in a systematic manner, moving parts of Apertium to a git repository involves deciding what to move and what to keep only in svn.

To help dear PMC members to decide, here is an inventory of things currently found in the svn repository:

  • engine (apertium, lttoolbox, apertium-lex-tools)
  • linguistic data for the engine
    • languages
    • incubator
    • nursery
    • staging
    • trunk
  • tools for creating/learning/managing/serving linguistic data
  • documentation
    • official documentation
    • papers
    • courses
    • Apertium's wiki dump can be archived here as well if it's not enormously big :)
  • some sort of experimental/playground/sandbox content (but not linguistic data related -- otherwise it would be in incubator. Many things in branches/ are of this category)
  • language data not directly used by the engine (freely distributable (small parallel) corpora, wordlists etc)

It looks like Apertium has developed several machine learning tools, but, as always, there is not enough data for every language pair -- having an explicit languageData/ directory might be a step towards systematising what we have and creating some more. --selimcan (talk) 06:32, 23 February 2015 (CET)

I was thinking perhaps a "corpus" module on the same level as trunk/incubator/staging/nursery would be nice to have (a bit off-topic from the whole git thing though, I don't see any advantage to keeping corpora in git vs svn). --unhammer (talk) 12:34, 24 February 2015 (CET)

In git, maybe the following structure would make more sense:

  • Engine
    • lttoolbox.git
    • apertium.git
    • apertium-lex-tools.git
  • Linguistic data
    • Monolingual or "Languages"
      • apertium-kaz.git
    • Multilingual or "Translators"
      • apertium-kaz-tat.git
      • apertium-turkic.git
  • Tools
    • apertium-html-tools.git
    • apertium-quality.git
  • Corpora
  • Builds
  • Documentation

Then, instead of moving monolingual or bilingual packages from one directory to another (from incubator to nursery and so forth), we would just tag them as such ("annotated tags" — done by one of the PMC members — would then stand for moving the package to a higher level).

trunk/languages/nursery/… also lets you easily show/browse/check out all packages at that level though. We'd still need a way to do that, e.g. some homegrown web interface that showed all github.com/apertium packages with the tag "nursery" when you looked at e.g. http://apertium.org/packages/nursery (and gave you a clone url for that)

The proposal says that

Situations where git shines include
<...>
* using branches for development of new features while keeping the main branch stable 
<...>


This is valuable in many contexts (less terrifying to experiment/refactor, easier to collaborate etc). I don't want to go any further into this, just to relate it to development of linguistic data (although I am aware that migrating linguistic data is not planned as of now).


In my opinion, branching, among other things, would be especially useful in cases when several people working on different translators are relying on the same languages/ sub-module. Imagine that something is wrong with that monolingual language package that should be fixed. There are two options:

  • a) you can do that right away, without caring about the consequences for the translators which, very likely, will break (even if you do care, it might be hard to amend all translators at once) or
  • b) you can make the necessary changes on a development branch, and therefore do the right thing to the monolingual package and give other developers a chance to adapt their translators (maybe in a 'development' branch as well) at the same time.


Of course, we can just do a) and say that the developers of translators affected by the change should adapt to the only branch of the monolingual package there is, but that would mean that for a (little) while the "nightly" version of the translator performs worse than the latest released version of it (which is currently the case with Kazakh-Tatar pair on apertium.org vs turkic.apertium.org). I think this is counter-intuitive and not something we want.


Having a separate languages/ module solved the problem of duplicated code/effort, but without easy branching facilities, it just doesn't scale very well. Easy branching would allow us to collaborate without stepping into each other's shoes. --selimcan (talk) 06:31, 23 February 2015 (CET)