Mercurial vs Git: Scaling and Architecture

A long, somewhat technical post on why mercurial is better at scaling that git. Some history, too.

Everywhere I look, I see people using git. From experienced programmers to newbie programming enthusiasts, everyone seems to be using git. And GitHub, for that matter. In fact, we may even conclude that the popularity enjoyed by git is in large part due to popularity of GitHub. I mean, we can all admit that all the cool new things in FOSS are being published on GitHub. This just means that anyone willing to contribute needs to know git, at least at a beginner level. So, it’s obvious to anyone that learning git as your first version control system is good investment.

Now, let me make it clear that despite all appearances to the contrary, I am in no way criticising git or GitHub. In fact, I love git. Seriously. People may blame git (no pun intended) for a steep learning curve but I don’t. It’s a great piece of software and I thank Linus Torvalds for coming up with the idea, much as I thank him for linux. I also love GH a lot. So, don’t go hating, now.

Alright, let’s get back to the topic. First, a touch of history.

How it all began

I’m sure most of you know (specially since I just mentioned it) that Linus Torvalds created git. What you might not know, though, is why he did it. So, let’s walk down memory lane for a bit.

The linux kernel, often affectionately just called ‘the kernel’ by the FOSS community, has been in active development since 1991. Initially, the developers threw around patches through emails, reviewed through emails, and pretty much did everything via emails. There was pretty much no version control to speak of, as such, and it was said that Linus Torvalds didn’t scale. Later, the kernel stated to be version controlled on a proprietary DVCS called BitKeeper. But, it was not meant to be and they broke up in 2005. Awww…

This was the time the two people decided to make a completely new DVCS, both influenced to build on principles established by BitKeeper. These two people were Linus Torvalds, as you already know, and Matt Mackall. Both were adamant on creating a perfect - at least their vision of perfect - DVCS. Both had a vision of perfection. And, almost ten years later, we have both of them to thank for two absolutely fantastic version control systems - systems so good that they blew away all the competition, commercial or otherwise.

Linus decided to go with C, for reasons of speed, I should think. Matt, however, saw another language, one that was maybe not quite so fast as C, but with tremendous potential.

He decided to develop in Python.

The problem was well-know - Python was slow. However, with a clever bit of architecture, Matt made sure that this wasn’t too big of a problem initially. The advantage however, was great. Python brought with it an ease of development that drew in lots of new people as volunteers - Python is just that easy. And this decision paid off well, as we will see.

Now let’s talk some basics. Then, we’ll move on to a the scaling problems.

The Differences: Interface

First things first - mercurial is easy for beginners. This, quite mistakenly, makes people believe that it is a dumbed down version of git with lesser functionality, lesser… depth. Not so. Mercurial is every bit as powerful as git and anything that you can do with git, you can almost always do with mercurial. Most of mercurial’s interface is pretty much same as git anyway, with different terminologies. And so it was that the first time I used mercurial, I found it extremely easy to use.

But we’re not here to talk of the difference in the interface. So, let’s talk behind-the-scenes.

The Differences: Architecture

Before we start, let me warn you - this is not a detailed article for reading about the architecture of mercurial or git. I’ll mostly just touch the surface of things.

Git

First, let’s talk about git. If you’ve ever tried to look under-the-covers of a git repository, you’ll know of git objects. Git and mercurial both use a directed acyclic graph (DAG) to represent history with commits - or changesets, as they are called in mercurial - as nodes of the graph. Also, if you’ve read the first chapter of the Pro Git book, you will know that git stores a snapshot of your repository whenever you make a commit instead of the usual delta (or diff) that the other systems do. To proceed, we need to know what is a commit exactly. Well, this diagram may help (image taken from http://skookum.com):

enter image description here

So, a commit is a file that stores a bunch of data (author, commit message, parent commit) along with a reference to a tree object. A tree object stores references to other tree objects and blob objects (analogous to the usual file directory/file paradigm in a traditional file system). These blobs are the real building blocks of a git repository, actually. Any modified/new file is added to the .git/objects as a blob object. So, what of files unchanged files? Well, the parent tree object stores reference to the previous version of the blob object, as opposed to creating new ones. Again, an image might help:

enter image description here

Here, you can see that two tree objects refer to the same two blobs.

And so, this is how git works internally. This appears to be a very good model at first - and in fact, it is. Git’s architecture is designed to provide speed and the way git stores things, it makes both storage and retrieval blazingly fast. However, as your project grows large - specially if there are a lot of commits and a lot of authors - more and more git objects are created and stored as files. Often, several objects are created for a single commit. That’s when the problems begin - your project gets too large and git becomes slowww. How slow? Well, how about getting the result of a git status in 30 seconds instead of the usual microsecond response? Yeah, I know - that would suck pretty bad.

But that poses a question - How large does your codebase need to be? A 100,000 lines? A million lines? 10 millions lines? Well, the linux kernel is somewhere around 5 millions LOC and git works just fine with it. So again, how big?

Facebook big.

Facebook is 62 million lines of code and git has became slow for the engineers at Facebook. So, they decided to make mercurial faster than git. And they succeeded, as the facebook blog post shows. In fact, Facebook has assigned a few people to continue bettering mercurial, improving speed, reliability and usability which helps both the FOSS community as well as the devs at Facebook. It’s a cliched win-win situation, really.

Mercurial

Anyway, now let’s come to mercurial’s architecture. Mercurial is somewhat odd in it’s choice of storing data - in that it does not use either the delta or the snapshot approach exclusively. It tries to strike a balance between the two. I’ll explain this in a bit.

As I’ve said before, git uses git objects - blobs, trees, commits - as backend storage. Also, git creates a new snapshot for every changed file. Mercurial, however, uses a sigularly curious storage system - the Revlog. This is quite an ingenious way of storing data actually. Here’s how:

Every tracked file corresponds to two different files in the storage backend - an index file and a data file.
The data file (a file with a .d extension) contains zlib-compressed binary deltas for every revision of the file - it’s an append-only data structure which means that every new delta gets appended at the end of the .d file. There is a catch to this though.
The .d file doesn’t always store binary deltas. Sometimes, a complete snapshot is stored instead. So, how does mercurial decide between the two? Well, if the compressed delta is smaller than a certain calculated threshold for the file, then it is appended as a binary delta. If however the delta - or the stream of deltas that need to be applied to regain the revision - is larger than this threshold, then the entire compressed file snapshot is appended to the .d file. This threshold that I keep talking about is usually the size of the source file (or maybe even the .d file, for all I know), from what I’ve been able to gather, though I’m not necessarily certain.
For every commit (changeset in mercurial terminology), the index file stores the offset - the byte from which to begin reading in the .d file - and the length - the amount of data to read - from the .d file. This gives us the binary delta for a given changeset. Applying several sets of such deltas in a chain gives us the file at that changeset.

So, that’s the revlog storage format for you. Clearly, this means that there aren’t a gazillion files in the storage backend for your repository. Couple this with fast read operations (because of the index file) and you’ve got a system as fast as git, albeit git still appears faster at write operations. However, the analysis operations - like checking status, checking out a revision, getting diffs and the like - are faster. Also, no matter how large your history grows, there are always just two files for every file in your working repository. This means that mercurial doesn’t need to walk over those gazillion files - but git does. And this is what make the difference in scaling, all other things having similar performance.

Conclusions; Opinions

Neither mercurial nor git were built to scale to the level of the Facebook repository. However, in the recent years, mercurial has become capable to adapt better to the scalability problem while git has not. I think that this difference stems from the differences in architecture. In light of this, it will be exciting to see how git tackles this problem of scalability, though I scarce believe that that they (git devs) ever will. Oh well.

Anyway, mercurial is fast now. It wasn’t always, though. We have Facebook devs to thank that for a lot of recent advances in mercurial (aside from the open-source community, of course). Also, I think that it was a better awesome decision to make mercurial in Python - this means that even C noobs like me can contribute to mercurial. This generates my interest, and several others like me, if I can hazard a guess. With the revlog format, mercurial has truly became a viable alternative to git, specially in the recent years. It’s good to see the development going strong and I hope that we get more good stuff to see in the coming years.

I’m out.