Git - Large File Storage
Git is ruling the development world like no other. At some point, you may find photos, diagrams or other documents in your code, to keep track of them. But, is it really ok to store large files or even binaries in your code?
Git is ruling the development world like no other. At some point, you may find photos, diagrams or other documents in your code, to keep track of them. But, is it really ok to store large files or even binaries in your code?
Git
Git is a powerful source code management and versioning system, that should be in the portfolio of every developer. Due to its decentralized behavior, every developer is working on his own copy of code and no locking of files is required. It can be used to track changes on your source code and make it easy to work with others.
In the past, I wrote a couple of articles about Git. For example, you can find beginner articles like "Git - Getting Started", but also more advanced articles like "Git - Strategies".
Large Files
Every so often, you will need to store some larger files in your repository. This might be a PDF file, some pictures, binary blobs or else. In general, Git can handle these quite well, but there is also a huge issue.
Let's assume you are having a binary file, that changes with every commit. If the Binary file is only 1 MB in size. Git will not be aware what the diff looks like, but tries to store every binary for every commit, in the commit. Therefore, you will store this 1 MB for all commits. With 100 commits, you will already accumulate 100 MB. This is not an issue, if you already have the repository. But cloning the repo will take some time now.
In bigger projects with thousands of commits, pictures, screenshots, PDF documents and much more, this might become a real pain. I have seen repositories with over 4 GB of accumulated assets, but only ~90 MB in the most recent branch.
This can be improved with Git LFS.
Git LFS
Let's have a look at Git LFS and how it helps to store large files, so you don't need to download everything at once, safe some bandwidth and make large files more convenient.
Requirements
First things first. You need to ensure that your installed Git client supports Git LFS. In Fedora, this is included in the git-lfs
package, but not in the default git-core
package. Therefore, you need to install if, if you haven't already.
# Fedora/AlmaLinux/CentOS
$ sudo dnf install git-lfs
# Fedora IoT/Silverblue
$ rpm-ostree install git-lfs
Furthermore, your Git server provider must support Git LFS. GitHub, Gitea, Bitbucket, and GitLab can do this, but may have some limitations. Please check the documentation of your provider, to ensure that your use case is supported.
How it works
So, how does Git LFS work? On a higher technical level, this is pretty easy. Large files will be stored in a separate data store, which is linked to your repository. Meaning, all marked files will be located on a dedicated storage volume, and only their reference is maintained in repository itself. Git does this by creating pointers to the desired files.
Git LFS does this seamlessly, meaning on your local copy, you will see only the actual content. You can use all the known commands like git add
, git commit
or git pull
to interact with your local copy.
How to do it
After getting the requirements done and deciding that you want to use Git LFS, you might wonder: "How can I make it work?" Actually, this is super easy.
Let's assume, you have a bunch of PDF files, that you want to push to LFS instead of the code repository. First, you need to install the extension to your repository. This will basically update your Git hooks.
$ git lfs install
Updated Git hooks.
Git LFS initialized.
Next, you need to track the desired files and make git aware, that these should be handled as "large files" and therefore pushed to the LFS.
$ git lfs track "*.pdf"
Tracking "*.pdf"
This will only track files with the *.pdf
extension in your root of the repository. If you want to track all PDF documents in all subdirectories, this works like:
$ git lfs track "**/*.pdf"
Tracking "**/*.pdf"
But wait, what does this actually do? Actually, nothing fancy. It just updates your .gitattributes
files. Let's take a look.
$ cat .gitattributes
*.pdf filter=lfs diff=lfs merge=lfs -text
**/*.pdf filter=lfs diff=lfs merge=lfs -text
The syntax is pretty similar to a .gitignore
file. If you want to dig deeper and find more options about these files, please consider to check out the documentation for gitattributes and gitignore.
Now that this is out of the way, you can push and pull your content as usual.
# Track changes
$ git add .
# Commit changes
$ git commit -m "using gitlfs"
# Push changes
$ git push
Pulling, pushing, fetching, merging etc. are working exactly like before.
To remove a tracked file or extension, you can either edit the .gitattributes
file or untrack the file.
$ git lfs untrack "**/*.pdf"
Untracking "**/*.pdf"
And, if you want to fetch all documents from a remote?
$ git lfs fetch --all
You might also want to get rid of old ones to keep your disk tidy.
$ git lfs prune
Docs & Links
There is way more about LFS, gitattributes and gitignore. Below, you can find some documentation and articles about the same. Please also check out the articles on my blog.
Conclusion
That's already it about tracking large files and making use of Git LFS. It's a nice way to reduce traffic, save some bandwidth without avoiding binary data.
I would love to know if you use LFS already. Which files do you track/consider to track? Why do you do so?