Git - Large File Storage

Git is ruling the development world like no other. At some point, you may find photos, diagrams or other documents in your code, to keep track of them. But, is it really ok to store large files or even binaries in your code?

Git - Large File Storage
Photo by Zoltan Tasi / Unsplash

Git is ruling the development world like no other. At some point, you may find photos, diagrams or other documents in your code, to keep track of them. But, is it really ok to store large files or even binaries in your code?

Git

Git is a powerful source code management and versioning system, that should be in the portfolio of every developer. Due to its decentralized behavior, every developer is working on his own copy of code and no locking of files is required. It can be used to track changes on your source code and make it easy to work with others.

In the past, I wrote a couple of articles about Git. For example, you can find beginner articles like "Git - Getting Started", but also more advanced articles like "Git - Strategies".

Large Files

Every so often, you will need to store some larger files in your repository. This might be a PDF file, some pictures, binary blobs or else. In general, Git can handle these quite well, but there is also a huge issue.

Let's assume you are having a binary file, that changes with every commit. If the Binary file is only 1 MB in size. Git will not be aware what the diff looks like, but tries to store every binary for every commit, in the commit. Therefore, you will store this 1 MB for all commits. With 100 commits, you will already accumulate 100 MB. This is not an issue, if you already have the repository. But cloning the repo will take some time now.

In bigger projects with thousands of commits, pictures, screenshots, PDF documents and much more, this might become a real pain. I have seen repositories with over 4 GB of accumulated assets, but only ~90 MB in the most recent branch.

This can be improved with Git LFS.

Git LFS

Let's have a look at Git LFS and how it helps to store large files, so you don't need to download everything at once, safe some bandwidth and make large files more convenient.

Requirements

First things first. You need to ensure that your installed Git client supports Git LFS. In Fedora, this is included in the git-lfs package, but not in the default git-core package. Therefore, you need to install if, if you haven't already.

# Fedora/AlmaLinux/CentOS
$ sudo dnf install git-lfs

# Fedora IoT/Silverblue
$ rpm-ostree install git-lfs

Furthermore, your Git server provider must support Git LFS. GitHub, Gitea, Bitbucket, and GitLab can do this, but may have some limitations. Please check the documentation of your provider, to ensure that your use case is supported.

How it works

So, how does Git LFS work? On a higher technical level, this is pretty easy. Large files will be stored in a separate data store, which is linked to your repository. Meaning, all marked files will be located on a dedicated storage volume, and only their reference is maintained in repository itself. Git does this by creating pointers to the desired files.

Git LFS does this seamlessly, meaning on your local copy, you will see only the actual content. You can use all the known commands like git add, git commit or git pull to interact with your local copy.

How to do it

After getting the requirements done and deciding that you want to use Git LFS, you might wonder: "How can I make it work?" Actually, this is super easy.

Let's assume, you have a bunch of PDF files, that you want to push to LFS instead of the code repository. First, you need to install the extension to your repository. This will basically update your Git hooks.

$ git lfs install
Updated Git hooks.
Git LFS initialized.

Next, you need to track the desired files and make git aware, that these should be handled as "large files" and therefore pushed to the LFS.

$ git lfs track "*.pdf"
Tracking "*.pdf"

This will only track files with the *.pdf extension in your root of the repository. If you want to track all PDF documents in all subdirectories, this works like:

$ git lfs track "**/*.pdf"
Tracking "**/*.pdf"

But wait, what does this actually do? Actually, nothing fancy. It just updates your .gitattributes files. Let's take a look.

$ cat .gitattributes 

*.pdf filter=lfs diff=lfs merge=lfs -text
**/*.pdf filter=lfs diff=lfs merge=lfs -text

The syntax is pretty similar to a .gitignore file. If you want to dig deeper and find more options about these files, please consider to check out the documentation for gitattributes and gitignore.

Now that this is out of the way, you can push and pull your content as usual.

# Track changes
$ git add .

# Commit changes
$ git commit -m "using gitlfs"

# Push changes
$ git push

Pulling, pushing, fetching, merging etc. are working exactly like before.

To remove a tracked file or extension, you can either edit the .gitattributes file or untrack the file.

$ git lfs untrack "**/*.pdf"
Untracking "**/*.pdf"

And, if you want to fetch all documents from a remote?

$ git lfs fetch --all

You might also want to get rid of old ones to keep your disk tidy.

$ git lfs prune

There is way more about LFS, gitattributes and gitignore. Below, you can find some documentation and articles about the same. Please also check out the articles on my blog.

Git - Getting Started
If you just started with development, you may have never heard of Git or Source Code Management at all. This article will provide an overview why you should get in touch with Git and how to use it for your projects.
Git - Strategies
Using Git is very common among developers nowadays. But, how do you work together on Git? What is the best strategy when it comes to discussions about trunks, mono-repos and what-not? Well, let’s dig into this.
Git - gitattributes Documentation
Git - gitignore Documentation
Git LFS - large file storage | Atlassian Git Tutorial
Git LFS is a Git extension that improves handling of large files by lazily downloading the needed versions during checkout, rather than during clone/fetch.

Conclusion

That's already it about tracking large files and making use of Git LFS. It's a nice way to reduce traffic, save some bandwidth without avoiding binary data.

I would love to know if you use LFS already. Which files do you track/consider to track? Why do you do so?