Rewriting Git project history with The BFG | Info | theguardian.com

Fast & simple Git history rewrites with The BFG

Git stores your entire project history, and its commits are immutable - so how can you remove unwanted data stored deep in your repository's past? The answer was once 'git-filter-branch' - the new alternative is The BFG
Raspberry Pi rewrites a Git project
Raspberry Pi running The BFG, cleaning commits about twice as fast a quad-core Mac running 'git-filter-branch'

Some things you shouldn't share: passwords, private keys, and unwanted gigabytes of random data. Something you do want to share is your code, in a Git repository, which... oh, damn - has a bunch of that awful stuff checked in from ages ago. Maybe some of those things are still in your latest commits, maybe they were deleted soon after they were added, but now you have to clean your Git history - to make it as though those things were never there.

The traditionally recommended tool for doing this is the magnificent chainsaw git-filter-branch. It's a little complicated to use, and on a big repo it can take many hours to run. When I came to think about it, I realised I could create something much faster.

You might not need a chainsaw. You might need The BFG.

A demonstration comparing speed between 'git-filter-branch' and The BFG

The BFG is 10-50 times faster than git-filter-branch at removing unwanted data - it can turn an overnight job into one that takes less than five minutes.

Why is changing history such a tricky thing to do?

There's this infamous video of Linus Torvalds giving a talk at Google about Git. Linus mentions a guarantee that he regarded as a key feature in Git - simply, that you always get out what you put in (apparently few other source-control systems made that guarantee, prompting much incredulity from Linus). To make this guarantee watertight, Linus enforced it in the simplest way possible: the id for every object stored in Git (file, folder, or commit) is a strong hash derived from the contents of that object itself.

You can read about the Git object model in detail on git-scm.com - the important thing to understand here is that each specific version of a file, or folder, or commit is an object in Git, and is assigned an immutable hash-id which exactly identifies its contents. If you were to change the object even slightly, its id would change completely. If you ever have an object that's identical to some object you've had before, their ids will match exactly, and only one copy will ever be stored.

The id of a commit is particularly precise - its id depends not only on its content, but also on the ids (and thus content) of all the commits that came before it. The id of a commit embodies its entire history.

Changing any commit in Git history means creating a new copy of every single commit that comes after it - all of the commits need to reference the updated commit id of their parent. It has to be done sequentially, going from oldest commit to newest - you can't create a clean commit without first having the commit id of its cleaned parent.

How does The BFG do this faster than git-filter-branch?

git-filter-branch steps through every commit in your history, executing whatever shell scripts you gave it against the contents- the full file tree -of each commit (so you can write a bash script to, for instance, delete a file), and this gives you a crazy amount of power. Too much power.

Each commit you clean, only a small amount of data will have changed - but your bash script is running over the entire file tree of the commit. You're cleaning the same damn files over and over again. That is slow and, speaking broadly, totally freakin' redundant.

Remember that for a given set of file contents, that file will only be stored once in the Git DB. Remember that a folder containing files & sub-folders will only be stored once, if the files & sub-folders have not changed. Why clean those precise files more than once? Git is begging you not to repeat yourself.

This is the idea of The BFG: Clean a given Git object once. Remember the result: Store the 'dirty' id and the 'clean' id in a simple map, and every time you encounter an object (file or folder) while cleaning a commit, check its id to see if you've cleaned it before, and if you have, just use the cleaned object you stored from last time. Frequently, you get a big win and a massive sub-folder does not have to be cleaned, because you already have the Git-id of what it looks like when it's been purged.

This kind of structure is also very amenable to parallelism, so while you have to clean commits in order, you can still actually fire off a ton parallel workers to clean their file contents, and get good use of all the CPU cores in your computer.

Clean your room: Don't give me the details

Given the hash-id for a folder or file, you know everything about the contents of those things - you can't easily see what commits they belong to, or where they sit in the file tree (could be in lots of different places), but hey: If all you want to do is strip bad data out of those folders, does it matter what commits they live in, or how far down the folder tree they sit?

If you're removing passwords or just plain big files, the answer is No. You just want them gone. The repo won't get smaller unless you delete all the copies of that big file. The internet will still laugh at you if it finds a password in your repo, no matter where it is.

We don't really care where the bad stuff is: we just want it burnt, with the fire of a thousand suns.

Just a little simpler

The BFG is focused on doing a subset of the things you could do with git-filter-branch - mainly removing unwanted data, so its command interface is a little simpler to use too. This is how you delete a file with git-filter-branch:

$ git filter-branch
--index-filter 'git rm --cached --ignore-unmatch myfile.txt'
--tag-name-filter cat -- --all

This is the same thing with The BFG:

$ bfg --delete-files myfile.txt

There are a few other nice touches too - for instance, while cleaning your commits, if it sees you've used a commit id as part of your commit message, e.g.:

This fixes the bug that was introduced with the refactor in 35c48634.

...then it'll update the commit message, substituting '35c48634' with the cleaned id of that commit, so the reference still makes sense.

Anything else?

The BFG's open-source, written in Scala (which means you can create your own custom cleaners in any JVM-based language) and uses the wonderful JGit library for all Git operations (so it doesn't need to do fork-and-execs which would slow it down). It's been used on four different big projects within the Guardian, and on many projects out there in the big wide world. I'd love to hear your feedback if you use it, and hope you've enjoyed hearing about it.

The BFG is also featured in a interview with Roberto Tyley on the GitMinutes podcast. You may also be interested in gu:who, an automated bot for auditing membership of GitHub organisations.