But try to understand
Try to understand
Try try try to understand
Git’s a magic command.
– Heart 💕
I knew git stored diffs somewhere. I mean, it’s obvious—right?
All git ever shows the casual user is a diff! My pull requests were diffs.
git show: diff.
git diff? Duh! It’s a diff, too.
But later, I learned the truth—git’s interface belies its internals. There’s a mismatch between what git shows you vs. how git works.
It’s challenging to wield git’s interface when your mental model of the internals is broken. And after I corrected my mental model of git’s internals, I was able to stop relying so heavily on git’s truly terrible interface.
In this post, I’ll attempt to explain all the deep details of
git diff to my past self.
Git add makes blobs ¶
We can add files to repos using
git add. But behind the porcelain, git’s busy compressing and storing this file deep in its bowels. Git terms the results of this process a “blob.”
Git stores blobs (among other things) inside the
$ git init Initialized empty Git repository in /tmp/bar/.git/ $ echo "Hi, I'm blob" > foo $ git add foo $ tree .git/objects/ .git/objects/ └── 26 └── 45aab142ef6b135a700d037e75cd9f1f1c94dc
But what’s in a blob? And why is this blob stored as
🗄️ Git stores things by their hash ¶
git add foo store the contents of
Git mapped our file to a number via a hash function.
A hash function maps data to a unique number (mostly)—whenever the data changes, the hash function’s output changes dramatically.
SHA1 is the hash function git uses by default. And when we
git add foo git applies SHA1 to the contents of
Hi, I'm blob\n—and that spits out
Blobs are all about content. The filename “foo” doesn’t matter at all! We could have named the file “🌈”—git still would have stored it in the same place. If the file contents are EXACTLY the same, then the hash will be exactly the same.
🌱 Git commit creates commits and trees ¶
You already know
git commit creates a commit, but what is a commit?
A commit is a type of object. Git uses the word “object” to mean: a commit, a folder or directory (tree), a file (blob), or a tag. Git stores objects in its object database—everything inside the
$ git commit -m 'Initial Commit' [main (root-commit) 0644991] Initial Commit 1 file changed, 1 insertion(+) create mode 100644 foo $ tree .git/objects/ .git/objects/ ├── 06 │ └── 449913ac0e43b73bfbd3141f5643a4db6d47f8 ├── 26 │ └── 45aab142ef6b135a700d037e75cd9f1f1c94dc └── 41 └── 81320a57137264d436b2ef861c31f430256bf4
After our commit, the object database has three objects:
So now we’ve established that one of these three objects is our blob (
2645aab1)—let’s see if we can suss out the others.
✨ The magic command ¶
The magic command to learn about any object is
git cat-file -p. We can use that command to find out more about our mystery objects:
$ git cat-file -p 06449913ac0e43b73bfbd3141f5643a4db6d47f8 tree 4181320a57137264d436b2ef861c31f430256bf4 author Tyler Cipriani <email@example.com> 1652310544 -0600 committer Tyler Cipriani <firstname.lastname@example.org> 1652310544 -0600 Initial Commit
This object (
06449913) appears to be our commit. A commit is metadata compressed and stored inside git’s object database.
Some of the metadata is obvious, but then there’s a tree. And that tree points to our other mystery object,
418132. Let’s see what we can learn about our last remaining mystery object using our magic command:
$ git cat-file -p 4181320a57137264d436b2ef861c31f430256bf4 100644 blob 2645aab142ef6b135a700d037e75cd9f1f1c94dc foo
So a tree is an object that stores a directory listing of objects by their SHA1s. And a commit is an object that points at a tree by recording the tree’s SHA1!
Commits point to trees, and trees point to blobs and other trees. Neat!
📈 Git’s dependency graph ¶
So if we graphed the state of dependencies in our object database, we’d get something like this:
The commit incorporates our tree, which includes our blob—everything depends on our blob!
So if we change even a single bit inside a single file: git will notice—everything is entirely traceable from the commit down to the bit level. We get this for free by hashing objects and including those hashes in other objects.
This is the whole concept of a Merkle Directed Acyclic Graph (Merkle DAG)!
🍔 So, where’s the diff? ¶
When we type
git diff, git presents us a diff. We know there are blobs and trees and commits—so where’s the diff!?
Git doesn’t store diffs anywhere at all! It derives diffs from what’s stored in the object database.
$ echo "I'm ALSO blob" > baz $ git add baz $ git commit -m 'Add baz' $ tree .git/objects/ .git/objects/ ├── 06 │ └── 449913ac0e43b73bfbd3141f5643a4db6d47f8 ├── 26 │ └── 45aab142ef6b135a700d037e75cd9f1f1c94dc ├── 41 │ └── 81320a57137264d436b2ef861c31f430256bf4 ├── 95 │ └── 42599fac463c434456c0a16b13e346787f25da ├── 9b │ └── 2716e4540c11e8d590e906dd8fa5a75904810a └── e6 └── 5a7344c46cebe61d052de6e30d33636e1cd0b4
We made a new commit, and now we have three new objects. We added a new file (blob), which made our directory different (tree), and we committed it (commit).
Our graph now looks like this:
You might be surprised by a few things in the graph:
- Our new commit stores its parent commit as metadata
- Our new tree points to our old blob, and our NEW blob
So now what happens when we try git diff:
$ git diff 064499..e65a73 diff --git a/baz b/baz new file mode 100644 index 0000000..9b2716e --- /dev/null +++ b/baz @@ -0,0 +1 @@ +I'm ALSO blob
Git compares the two commits, finds their trees, sees a new blob in the second commit, and shows you the diff of
No diffs. Just Merkle DAGs. And now you know.