How Structure Affects Git's UX
It’s always interesting to me to compare different approaches to solving the same problem. Git and Mercurial are two version control systems that came out at similar times, trying to address very similar requirements. Git came from a very low-level systems perspective, whereas Mercurial spent a lot of effort on its user experience. Despite what you might think, their data models are remarkably similar. It’s from this observation I started my side project — gg. I found myself greatly missing the experience of Mercurial, but I’ve resigned myself to the fact that Git is here to stay.
I came across a rather interesting challenge today while working on gg. I am
trying to replicate the behavior of hg pull
, and even though I’ve worked on gg
for over a year now, I still haven’t reached a behavior that I’m satisfied with.
I’ve finally realized why, and it boils down to a very subtle difference in the
data models of the two systems.
At its core, Git has two concepts: commits and refs. A repository stores commits in an unstructured graph form. Git considers any commit not reachable from a ref to be unimportant, and will delete such commits periodically. Such commits are typically the side effect of rebases or deleted local branches. Refs act like garbage collection roots, and in fact, Git refers to this as garbage collection. (Not the sort of judgement you want from a system safeguarding your code!)
Mercurial, on the other hand, makes no such distinction. All commits in the
revlog are part of the observable history. Mercurial bestows a fairly
arbitrary default ordering on the commits, which is the order in which they were
appended to the revlog. Mercurial does have a concept similar to refs called
bookmarks, but this concept was introduced later and thus isn’t fundamental.
In Mercurial, one must explicitly remove commits from the revlog using an
extension like hg strip
or mark them as obsolete with the new
changeset evolution features. This is conceptually very simple: once you see
a commit, it is now part of the history.
This difference in data model causes complexity higher up in Git. When
fetching commits from another repository, it is very difficult to do so
without configuring what Git calls “remotes”.
Fundamentally, the problem is that when git fetch
adds commits to the
repository, it needs to attach refs to those new graph leaves lest they be
garbage collected. Remotes specify patterns for Git to use to create refs,
thus preserving the new commits. Mercurial does not have this problem! In
Mercurial, it simply appends the commits to the revlog, and there’s nothing
more to do.
Git’s choice to distinguish between reachable and unreachable commits resulted in more complexity for the end-user in the form of configuration. Mercurial made no such distinction and thus no configuration is required. The lesson I take away from this is that it is important to constantly reevaluate your data model as you build software. While this is certainly not a new observation — Rob Pike famously wrote “Data structures, not algorithms, are central to programming.” — it is one that cannot be stated enough.