Git: Removing sensitive data and rewriting history

April 27, 2012

This post was originally published in the Rambling Labs Blog on April 27, 2012.

One thing that tends to happen with Git, is being careless with what you add to your repository. I’ve seen this happen since I began using Git, a couple years ago, and it stays happening over and over again, for various reasons.

On one of the projects I was a long time ago, we added lots of files with sensitive information, including emails and passwords for some accounts for certain services. We also added a lot of images that weren’t really needed, and some of the ones that were needed, were not in the optimal size either. We were just starting up with Git, and we didn’t know what to include or exclude.

To avoid making the story any longer, I just have to say that this repository was already 800MB in size last time I checked… 800MB!!! Not cool… Not cool at all.

The main cause for this repository to grow this big was the accidental addition of big files on a certain commit, and then removal on a subsequent commit. This is a novice mistake. As a beginner, one tends to forget (or simply ignore) that git will still have the file in its index, for history purposes. Now, repeat that process several times with files big enough, and you’ll end up with a repository of 500MB, even when the project only has a total of, say, 10MB.

A size like this makes it really hard to clone and to handle in general, which is not ideal if you want to maintain a good pace, even for people new to the project.

So, what do you do? Rewrite history. How do you it? I’ll cover up two cases: removing files from the repository’s history, and changing a file’s content.

Disclaimer: This is not recommended for projects with many contributors, because there is some processing that has to be done in each local environment to ensure that noone re-adds the files or changes removed.

Removing files from the history with ‘git filter-branch –index-filter’

This is the way to go when you have added files accidentally, be it files with sensitive data as passwords or just big files, and you want to remove them completely from the repository history.

Basically, what we want to do is simply remove the file from all commits and, also, prune the empty ones after they have been rewritten.

The GitHub guys have a great tutorial for this, which I completely recommend.

Now, let’s say that, for example, I have hundreds of .mp3 files added to my repository, located on a resources directory, and which I would like to remove from the history of the repository. You will have to do something like this:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch -- resources/*.mp3' --prune-empty -- --all

Note that we are specifying the --index-filter option, and that it receives a command to execute on each of the repository’s commits. By default, it evaluates this action on the whole history of the repository, but you can specify a commit range as you do on any other git command.

So, what this will do is evaluate the command passed (which is removing the .mp3 files from the index for this case) for each commit, and then it will generate a new commit. Finally, note that we specify the --prune-empty option so that it prunes the commits that will result empty. This means that any commit that was related only to the deleted files will be removed as well.

Changing a file’s content with ‘git filter-branch –tree-filter’

Now, if you want to change a file’s content instead of removing it, then the --tree-filter option is the way to go. As the --index-filter option, this one receives a command to execute on any file on the repository.

So, if, for example, you added a password accidentally to the config/database.yml of your app and you would like to just change it to a dummy password, you can do something like this:

git filter-branch --tree-filter 'sed -i s/the_password/the_dummy_password/g config/database.yml'

Here, I’m using the sed -i which will edit the files in place, instead for spitting the changes through the standard output. As you can see, I don’t need to add the file to the git index again after making the changes, since the --tree-filter option will automatically add them.

Making these changes available to others

Disclaimer: A forced push is not recommended either, but the other work arounds I’ve found aren’t very pretty.

In order to make the changes available to others (or you in the future), you could force a push. This will ensure that everyone that clones the repository from there on will have the new version of the repository, be it with the deleted files or with the changes to sensitive data. It is important to note though, that this will only work for the people that make a fresh clone of the repository. If you have other contributors and you would like them to have the new version, they will have to either fresh clone the repository again or run the commands as well before performing a pull.

I want to emphasize that your other contributors should download a fresh clone of the repository or run the commands. The problem with performing a pull for a person with the older version of the repository, is that it will merge the old commits (with the files still added or the sensitive data) with the new commits. This would result on some “duplicated” commits, and would re-add the deleted files or override the changes we’ve made. So try to be very careful with that.

One last thing that I want to point out is that old commits will still remain on the remote repository, but without any references pointing to them.

To finish this one off, I just wanted to clarify that I’ve used this mostly on personal projects or with a low number of contributors. I’ve also used these techniques when the repository is too big and has taken over 700MB, but it has been under a controlled environment.

Well that’s it for now. Remember to use this wisely and carefully! You don’t want to piss off your coworkers with funky git log outputs.

Enjoy!