There are different ways to categorize the parts of a repository.
In particular, Git offers both bare repositories and repositories without any adjective in front. We could call the latter non-bare repositories, or repositories that include a work-tree, or something along these lines, but Git in general just uses the word repository.
In the case of a non-bare repository, your working tree—which I like to call work-tree, hyphenated, as one word, for short—contains a .git
directory.1 That is, you might do:
cd proj
to enter the work-tree for your project (which you've called proj
), and in proj
there is a .git
directory.
In the case of a bare repository, your repository would typically be named proj.git
. This directory would contain exactly the same set of files that proj/.git
contains in the non-bare case.
In both cases, the files inside the .git
directory—whether that's proj/.git
, or proj.git
—make up the repository proper. The files in your work-tree, if there is a work-tree at all, are there for you to work with.
1In some special cases, it might contain a .git
file, rather than a directory. You might encounter this in the future if you work with Git's submodule system. For now, you probably won't run into this case.
What's in a repository?
There is no harm in looking at the contents of a .git
directory. If you do, you will find these items (and a bunch more) in modern Git. None of these are promised to exist; they might change in the future. But I think it helps to know about them:
A file named config
. This holds the local configuration for the repository. (It is a plain-text file, so you can view it however you like.)
A file named HEAD
. This holds the name of your current branch. You generally should not count on this, because there are conditions in which HEAD
contains a raw hash ID—this is what Git calls a detached HEAD—and there are some ways to use Git where this particular file isn't always relevant.
A directory named objects
. This is where Git stores most of its main database.
A directory named refs
. This is sometimes where Git stores most of its secondary database—but not always.
A file named index
(not always present, but almost always). This holds Git's index or staging area.
From the above list, we can see that a Git repository mainly consists of two databases. One—the bigger one by far in most repositories—is a database of Git objects, which wind up somewhere deep under the objects
directory. Every Git object has a unique hash ID, which is a big ugly string of letters and digits like 083378cc35c4dbcc607e4cdd24a5fca440163d17
. You'll see these in git log
output, for instance.
The most important kind of Git object is the commit. There are three other kinds of objects. You will hardly ever deal directly with any of them, but you should know that one of them is called a tree object, one is an annotated tag object, and the last one is a blob object. Tree objects store file names, while blob objects store file content. Annotated tag objects store the extra information that goes into an annotated tag.
The second database in a Git repository consists of names, which Git calls refs or references, that hold hash IDs. Each name holds exactly one hash ID. The names are branch names, tag names, and all the other kinds of names that you can see or that Git uses internally while working.
So, this gives us a proper description of a repository. It is:
- a database of Git objects, primarily featuring commits, that Git looks up by their hash IDs; plus
- a database of names, such as branch and tag names, that provide hash IDs for Git to look up.
There are a few other smaller items that get added to this, such as the special HEAD
, the index / staging-area, and so on, but those two databases are the bulk of any Git repository. They are also deeply involved in how git fetch
and git push
, which transfer commits and other Git objects between two Gits, work.
The index and your work-tree
Your work-tree, if there is a work-tree at all—if this is a non-bare repository—is mainly just something that Git will fill in from commits, when necessary. It's where you will do your work. The reason you want or need a work-tree is simple: everything in a Git commit is frozen for all time. No part of any Git commit can ever be changed: not by you, and not by Git itself. This makes the commits great for archival, but completely useless for doing actual work.
The frozen (archived) files inside each commit2 are in a compressed, read-only, Git-only format. You presumably want them as files that aren't compressed and Git-only, so Git extracts them into ordinary, everyday computer files, in your work-tree.
When you go to make a new commit, though, Git doesn't use what's in your work-tree. Instead, Git uses copies of files that are in the index.3 The index, or staging area, effectively holds a copy of every file, ready to go into the next commit. An initial git checkout
of some commit fills the index / staging-area from the commit, and also copies out (and un-freezes and uncompresses) the file into your work-tree, so that the current commit, proposed next commit, and work-tree all match.
This is why you must run git add
so often. If you've updated the work-tree copy, you have to have Git re-compress the updated file, storing the frozen-format, ready-for-Git copy in the index (see footnote 3 again) so that the updated version is now proposed for the next commit. That's what git add
does.
When you run git commit
, Git packages up everything that is in the index. The proposed next commit becomes the actual commit. This commit stores the index's copies of the files as the commit's frozen snapshot. The commit also stores some metadata, such as your name and email address and the current date-and-time. Each commit also lists a set of previous commit hash IDs, usually just one, so this new commit that git commit
makes lists the hash ID of the commit you were using, just a moment ago.
The new commit you just made then becomes the current commit. The act of writing out the new commit produces the unique hash ID for the new commit, so Git now stores that hash ID into the current branch name, as recorded in the special HEAD
file. You have now updated both databases: the objects database has a new commit (and perhaps new tree and blob objects too), and the name-to-hash-ID database now records the new hash ID for the current branch.
2Technically, the commit simply lists a tree object, which gives the file's names. That tree object lists more tree objects recursively if/when appropriate, and also lists blob objects, which store the files in their frozen and compressed form. But it suffices to think of the files as being stored, in their frozen form, inside the commits.
3Technically, the index doesn't contain the actual files. It only contains their names plus the hash ID of a Git blob object, plus some caching information to make Git work fast. The cache information is the source of a third name for this one thing. So it's called the index when referring to the actual file .git/index
, or the staging area when referring to how you use it, or sometimes (rarely these days) the cache when referring to the cached information. As before, though, it mostly suffices to think of the index as if it holds all the files.
git fetch
and git push
work with the databases
When you use git fetch
or git push
, you have your Git call up a second Git. Each Git has its own databases: its own collection of Git objects, and its own names. The two operations are slightly different.
When you use git fetch
, your Git calls up their Git. Their Git then lists out their names and the hash ID that goes with each name. Your Git checks your own objects database for each of these hash IDs. You either have the object, by that hash ID, or you don't. If you don't have it, and want it—based on the name they told you—your Git asks their Git to give your Git that object. They will do that, but will also, for a commit, tell you what the hash ID is of its parent commit. You either have that commit already, or you don't. If you don't, your Git will ask their Git to give your Git that commit. That commit has some parents too, and this repeats until they either run out of commits—they're giving you everything they have—or they reach a commit that you already have.
So, when you use git fetch
, you get from them all the (interesting) commits that they have, that you don't. The "interesting" part is based on the names they used for those commits: you can either bring over all their branch names, which is the default, or you can selectively only bring over commits identified by one particular branch name.
In the end, your Git has their commits plus your own commits, and also their name or names. Your Git then creates or updates your own remote-tracking names based on their branch names. For instance, if you ran git fetch origin
—origin
being your name for their Git repository—and they said their master
is 083378cc35c4dbcc607e4cdd24a5fca440163d17
, you now have that commit, so your Git sets your origin/master
to the hash ID 083378cc35c4dbcc607e4cdd24a5fca440163d17
.
In other words, git fetch
gets commits from them and updates your remote-tracking names.4 Your remote-tracking names, like origin/master
and origin/develop
, now hold the hash IDs that their branch names, master
and develop
in this case, hold.
The git push
command is a bit different. Your Git still calls up their Git, but this time, your Git offers them your commits, by hash ID. Your Git will, for instance, say I have commit a123456...
, do you have it? If they don't, your Git gives their Git this commit, and offers its parent commit(s). If they don't have those, your Git gives their Git those commits too. This is basically the same as fetch
, except that you're giving them your commits (that they don't have) instead of them giving you their commits (that you don't have).
At the end of all of this, though, your Git sends their Git a polite request of the form: Now, if it's OK, please set your master
to a123456...
. That is, after you give them your commits that you have that they don't, you ask their Git to set their branch names. You don't ask them to set some other name. This is very different from git fetch
. When you ran git fetch
, you got their commits, but then your Git updated your remote-tracking names. This leaves your branch names alone! But you're asking them to set their branch names.
The fact that you ask them to set branch names makes push trickier than fetch. On the other hand, the fact that they send you commits, and you update your remote-tracking names, rather than your branch names, leaves you with a problem: after git fetch
, you need a second Git command to actually use their commits.
Many people like to hide this second Git command by using git pull
. The git pull
command was originally a simple shell script that just ran git fetch
and then ran git merge
. Git's merge
is a big, complicated command. You won't necessarily always want to use git merge
after git fetch
. You can tell git pull
to use a different second command, but you have to decide in advance, at the time you type in git pull
, which second command to use.
I don't like this. Some of this dislike is just because, in the bad old days of Git version 1.5, git pull
would sometimes wreck everything, and I had that happen to me at least once. But some is because I like to use git fetch
first, then look at what fetch brought in, and only then decide whether I'm going to merge, or rebase, or do something else entirely. I can't do that with git pull
: I have to commit to merge-or-rebase before I see what came in.
In some usage patterns, this early commitment, way before I know what will actually come in, is OK. So I do sometimes (rarely) use git pull
anyway. Mostly, though, I use, and advise others to use, a separate git fetch
followed by whatever second command is appropriate.
4Git calls these remote-tracking branch names. I think the word branch in this phrase is a bad idea—these names are just different enough from real branch names that using that word is misleading—so I leave it out now.