8

For the longest time I thought git commits keep diffs of changed files and not copies. Any information I could find states the contrary. I conducted a little experiment:

$ git init
$ subl wtf

Here I create a file with 99 999 lines, each of which is foo bar baz #line

$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__    4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__    4096 Aug 13 19:57 ..
drwxrwxr-x 7 __user__ __user__    4096 Aug 13 21:02 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:02 wtf
$ git add --all
$ git commit -m 'Initial commit'
[master (root-commit) 6ef5084] Initial commit
 1 file changed, 99999 insertions(+)
 create mode 100644 wtf
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index 7ba3acb..bf7a9ed 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
 foo bar baz 14156
 foo bar baz 14157
 foo bar baz 14158
-foo bar baz 14159
+foo qux baz 14159
 foo bar baz 14160
 foo bar baz 14161
 foo bar baz 14162
$ git add --all
$ git commit -m 'bar -> qux on #14159'
[master 1b5ab4b] bar -> qux on #14159
 1 file changed, 1 insertion(+), 1 deletion(-)
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index bf7a9ed..1aeeaa3 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
 foo bar baz 14156
 foo bar baz 14157
 foo bar baz 14158
-foo qux baz 14159
+xyz abc baz 14159
 foo bar baz 14160
 foo bar baz 14161
 foo bar baz 14162
$ git add --all
$ git commit -m 'foo qux -> xyz abc on #14159'
[master 85ccf97] foo qux -> xyz abc on #14159
 1 file changed, 1 insertion(+), 1 deletion(-)
$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__    4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__    4096 Aug 13 19:57 ..
drwxrwxr-x 9 __user__ __user__    4096 Aug 13 21:05 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:04 wtf

Even commits on different branches with conflicts didn't change the situation.

If git truly keeps copies of all changed files with every commit, how come there was no significant change in space used?

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
  • 2
    The book *Pro Git* has an [interesting chapter](http://www.git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain) which explains in great detail how Git stores tracked data. – chepner Aug 13 '15 at 18:48

4 Answers4

11

The git has object database. There is a type of object "blob" which is identified by sha1 of its content. So, it means, if you have a file of the same content anywhere in repository (branch/point of history/directory/etc) it will be stored in the database only once.

There are two parts in the database, the objects/??/* files which are individual objects. I.e. if you have two versions of a large file which has only single line difference - it will be stored twice, in two different files (using simple lzma? compression).

Then, if git thinks the objects directory grew too much, it runs garbage collection. One of the steps of this process - repacking. It creates large pack files in the objects/pack/ folder which use clever delta-compression algorithm, and it works across not on a history of a particular file, but in the scope of the whole object database, so it means even if some completely unrelated files look similar occasionally, they could be packed as deltas of one another.

So, the deltas could be re-compressed differently after each git gc command taking in account latest changes in the history.

Also, object packs vs loose objects are only physical storage details, which are completely transparent when you use git everyday. E.g. doing log cherry-pick, merge etc are operating with full snapshot of a commit. So, if you are doing diff, it just compares two versions of a directory/files on fly, generating you a patch/diff.

This approach is quite unique in comparison to other VCS. E.g. Mercurial stores immutable delta-logs for each file separately, and Subversion is storing deltas for the whole repository. And it affects how system works - physical storage is not abstracted away and it causes some significant limitations, while git allows very flexible work-flows and algorithms while keeping the size of the repository very small

kan
  • 28,279
  • 7
  • 71
  • 101
  • So it stores *physical* copies, but might at some point decide to restructure similar objects with *deltas*? Could you expand on that in your answer or provide a link with details? Preferably something simpler than the implementation itself. Also maybe this is not what happens here (at least not how you describe it) because the `objects/pack/` directory is empty right now. – ndnenkov Aug 13 '15 at 18:57
  • @ndn Yes, it restructures during the garbage collection. Try to run `git gc` and all objects will be packed - files from `objects/??/` will be deleted and `objects/packs/` will be created. – kan Aug 13 '15 at 19:04
  • Then shouldn't all commits have *physical* copies of `wtf` at this moment? Why doesn't this take more space? – ndnenkov Aug 13 '15 at 19:06
  • @ndn Commits just refer content of wtf by its sha1. If file is not changing, all commits will have the exactly same reference of sha1 – kan Aug 13 '15 at 19:09
  • That was the point of of the question. `wtf` is changed, but there is still no real difference in space taken. How so? – ndnenkov Aug 13 '15 at 19:10
  • @ndn How do you calculate difference in space? – kan Aug 13 '15 at 19:14
  • The result of `ls -la` (seen above). – ndnenkov Aug 13 '15 at 19:16
  • 3
    @ndn `ls -la` doesn't show how much all files in directory take place. It shows how much disk space the list of file system directory entries takes. 4k - is a file system block size. Not related to git at all. Use `du -sh .git/objects` to calculate total sum of all file sizes. – kan Aug 13 '15 at 19:18
  • 1
    You are completely correct. The whole thing was just a silly mistake on my part. I feel ashamed right now :~ – ndnenkov Aug 13 '15 at 19:23
2

Every time a file changes, Git stores a new copy of that file in its database. A commit stores a reference to the most recent version of a file tracked by that commit. This means that when a commit is created, it uses the reference stored by its parent for unchanged files, and the reference to the newly added version for changed files.

Periodically (or on demand with, say, git gc), the database is compacted by creating pack files which contain the most recent version for each file in a given set, along with "reverse diffs" that can be used to reconstruct older versions as needed.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • This is not what happened here (at least the way I understand it) because the `objects/pack` directory is empty right now. – ndnenkov Aug 13 '15 at 19:04
  • Then you haven't had any files packed yet. – chepner Aug 13 '15 at 19:04
  • Then shouldn't all commits have *physical* copies of `wtf` at this moment? Why doesn't this take more space? – ndnenkov Aug 13 '15 at 19:06
  • A new copy of the file is only added if it changes in a particular commit. If a file doesn't change from commit A to commit B, then both commits will reference the same copy of the file. So while a commit is a full snapshot of the entire project, snapshots of individual *files* can be shared between commits. – chepner Aug 13 '15 at 19:11
  • That was the point of of the question. `wtf` is changed, but there is still no real difference in space taken. How so? – ndnenkov Aug 13 '15 at 19:12
  • OK, updated to make a distinction between the file stored in the database and the reference to such files stored in the commit objects. – chepner Aug 13 '15 at 19:22
  • @ndn, I assume you are asking why `.git` directory size is still 4096 bytes? Because it's not. `ls -la` doesn't show you the directory size including the sizes of all the files inside. Use `du -hs .git` for it. – Paul Aug 13 '15 at 19:35
  • @Paul, right. *kan* already explained that. The entire question turned out to be just a silly mistake on my part. :~ – ndnenkov Aug 13 '15 at 19:36
1

At least two mechanisms reduce the total storage needed in Git's object database. First, each object is compressed individually. Second, objects are lumped together into object "packs" that relate the objects with deltas, saving even more space for similar objects. There's a chapter on packfiles in ProGit which is quite illuminating.

Wolf
  • 4,254
  • 1
  • 21
  • 30
  • About the first point - I don't believe it is smart enough to reduce the above file to practically nothing. About the second one - so it does *physically* store *diffs* and not *copies*? – ndnenkov Aug 13 '15 at 18:40
  • Yes, it is physically storing (some) diffs (where deltas are possible), but they aren't anything like the diffs that, e.g., Mercurial or CVS or Subversion is storing. They're not text diffs. They're not related to anything you get out of the `git diff` command. – Wolf Aug 13 '15 at 18:42
  • This is very interesting. I really want to understand more about how git decides if it should store a *delta* or a full *copy*. Could you expand on that in your answer or provide a link with details? Preferably something simpler than the implementation itself. – ndnenkov Aug 13 '15 at 18:47
  • This is not what happened here (at least the way I understand it) because the `objects/pack` directory is empty right now. – ndnenkov Aug 13 '15 at 19:04
0

Git logically stores a distinct set of all file contents in the history. This means that if one character is changed in a 10 MB file the entire contents of the file have two different object ids. However, there is much optimization under the hood to make sure that similar objects are stored with deltas.

Joseph K. Strauss
  • 4,683
  • 1
  • 23
  • 40
  • So it does *physically* store *diffs* and not *copies* then? – ndnenkov Aug 13 '15 at 18:35
  • If that saves space. Other VCS's always do diffs even when doing a complete rewrite of file. In Git it will only do deltas where the content is similar enough to warrant it. – Joseph K. Strauss Aug 13 '15 at 18:42
  • This is very interesting. I really want to understand more about how git decides if it should store a *delta* or a full *copy*. Could you expand on that in your answer or provide a link with details? Preferably something simpler than the implementation itself. – ndnenkov Aug 13 '15 at 18:47
  • I guess you could look at the source code. I never had the desire to know _exactly_ how it determines space-savings, trusting that whoever wrote the optimization knew what he/she was doing. I believe that when the costs of maintaining the delta are more than just starting with a new delta base, it just ignores the delta and starts with a new object. See the documentation for [git cat-file](http://git-scm.com/docs/git-cat-file) to see how you can determine the space savings on an object-by-object basis. – Joseph K. Strauss Aug 17 '15 at 19:57