The extraordinary success of git(hub)

The previous post, How I developed my git aversion, talked about things that happened in the middle of 2007.

That was nearly a year before the launch of github, which launched publicly in April 2008. I know that because I just looked it up. I’m not sure I would have believed it otherwise: git without github seems like an alien idea.

Still, it must be true that github didn’t exist then, because it would have solved the two problems that I had with git. It answers the question of where to push to, when you’re using a random collection of computers that aren’t always on; and it provides the community of people you can ask questions of when you find yourself baffled.

And that community is? All developers. Or at least, all those who ever work in the open.

The amazing success of github—and it is facilitated by the architecture of git, if not the syntax of its tools—is to produce a public use of version control software that is completely out of proportion to the number of developers who ever cared about it before.

That’s because github has so many users who are not classic “software developers”, but I suspect that it’s also because so many software developers would never otherwise use version control at all. I can’t believe that very many of github’s current users are there for the version control. They’re there for lightweight and easy code sharing. Version control is a happy accident, a side-effect of using this social site for your code.

I still don’t really use github myself, partly because I don’t really use git and partly because of a social network antipathy. (I don’t use Facebook or Google+ either.) But it’s a truly extraordinary thing that they’ve done.

How I developed my git aversion

In the summer of 2007, I switched some of my personal coding projects from the Subversion version control system to git.

Git was especially appealing because the network of computers I regularly worked on was quite flat. I did some work on laptops and some on desktops at home and in the office, but for projects like these I didn’t have an always-on central server that I could rely on pushing to, as one really needs to have with Subversion.

So even though I was working alone, I was eager for a distributed, peer-to-peer process: commit, commit, commit on one machine; push to the other machine or a temporary staging post online; pick up later when I’m at the keyboard of the other machine.

Git wasn’t the first distributed version control system I’d had installed—that would be darcs—but it was the first that looked like it might have a popular future, and the first one I was excited about being able to use.

My excitement lasted about a week, and then I lost some code.

I lost it quite hard: I knew I’d written it, I knew I’d committed it, and I knew I’d pushed it somewhere other than the machine I’d written it on, although I couldn’t remember which machine that had originally been. But I couldn’t find it anywhere.

It wasn’t visible on the machine I’d pushed it to, and I couldn’t find it on any of the machines I might have pushed from. In fact, I never did find it. I’d managed to get my code enfolded into the system so that I could no longer get back to where I’d left it. I didn’t know anyone else who used git at the time, to ask for help. I’d fallen for a program that was cleverer than me and that wasn’t afraid to show it.

And as a long-time user of centralised version control systems, the idea of losing code after you checked it in was really a bit shocking. That shouldn’t ever happen, no matter how dumb you are.

So I went back to Subversion. Technically-better is not always better.

The reason for my confusion was that in my excitement I’d been imagining I could freely do peer-to-peer pushes and end up with the same repository state at both ends—something any fool could tell you is just not the way it works. (I probably lost the code by doing a push to a bare repository, something that is now harder to do by accident.)

As it happens, though, the way I had imagined it would work… is the way it works with Mercurial.

So when I found Mercurial I became happy again, and I’ve been happily using that ever since. Of course git has become so popular that you can’t really avoid it, and I know it well enough now that I wouldn’t make those mistakes again. Still, how much happier to use a system that actually does work the way you expect it to.

Rules of thumb for functional APIs

I’ve been trying to get to grips with what makes an API clean and pleasing to use in a functional programming language. (In my case this language has been Yeti, an attractive language that uses the Java virtual machine.)

Here are some notes, for my own reference as much as anyone’s. Some of this may be particular to Yeti, but most of it is (hopefully, if I have things right) going to be obvious stuff to any functional programmer. Please leave a comment if you have any more or better suggestions!

Avoid mutable state where practical

  • It’s easier to reason correctly about the behaviour of a series of functions that each take an object and return a new object derived from it, leaving the original one unaffected, than a series of functions that can each change some hidden state in the object and so affect the behaviour of all subsequent functions in an invisible way.
  • This is the nub of practical functional programming: pure functions—functions without hidden state—are predictable, testable, and easier to understand within a wider system.
  • Objects with mutable state should be limited to things that really do have some conceptual internal state, such as database handles or streams.
  • So where in an object-oriented language you may have a class with internal state plus a set of methods that act on it, organise this as a module in which the functions (named somewhat like methods) accept state and return some new state. The state is most likely a struct, maybe with getters but not setters.

Give functions names relative to their modules

  • A Yeti module is a file whose top-level code evaluates to something. Typically it contains one or more bindings (function declarations usually) which are returned within a struct at the end of the file’s top-level code.
  • (Modules can be loaded either with a plain load expression or in a binding: load my.module versus m = load my.module. In the first case a function func within the module would be referred to after loading simply as func; in the second, it would be m.func.)
  • I think it’s best to expect that everyone will be using the second form to load your module, and to name functions so that they make sense when they have the module name immediately before them. You don’t need to worry about name collisions with other modules or the standard library. As a programmer used to object languages, I think this helps when structuring code so as to avoid too much mutable state, because it means the module can take on the function of namespacing that would be carried out by the object class.

Distinguish between curried arguments and other ways of packaging

  • The obvious way to write a function that takes more than one argument in Yeti is using what are known as “curried” arguments, with syntax: f a b = a + b
  • This allows partial application: with two arguments f 2 3 is 5, but with only one, f 2 makes a new function that takes another argument and adds 2 to it.
  • Curried arguments are useful where callers might actually want to bind the first argument and then reuse the function. As an extreme example, one might introduce a second argument for a function that in theory only needs one, like an FFT: a function declared fft size data is redundant because size can be queried from the data array, but it allows the FFT tables to be precomputed when the first argument is bound. So fft 1024, leaving the second argument unbound, becomes a bit like an object constructor.
  • But it’s easy to get used to the idea that this is just how multiple arguments are passed. Many functions won’t have callers that want to do partial application and won’t benefit from knowing some arguments before the rest. And the disadvantage of curried arguments is that the caller needs to remember what order they appear in.
  • So, functions that take a set of related arguments at once, and can’t benefit from knowing one argument in advance of the rest, should accept them as named values in a struct instead. It makes for a more discoverable API.