Git forking a recipe for unrecoverable disaster?

Question

Using GIT seems to be like playing a new form of chess where nobody really knows the rules. Here is what I did, please let me know why this went wrong so quickly.

I have a main fork/repo and my personal fork.

git clone https://github.com/myfork/project.git
cd project
git submodule update
git checkout -b newbranch
git remote add upstream https://github.com/CompanyX/project.git
git fetch upstream  # <- hoping to update my fork
git pull upstream mainbranch # want the latest stuff from upstream
# Result: 100's of modified files, deleted files, and even merge conflicts... unrecoverable.

I've seen this a couple times now. If I create a branch in my fork before updating my fork from the main fork, the difference between my fork and the main fork will result in merge conflicts galore that in the end cannot be resolved.

Am I doing something wrong or is forking just hazardous?

What exactly is a fork, if it is not a duplicate of something? Is a fork yet another misnamed term in git?

I feel like I'm following the rules and learning the nitpicky details but things still go wrong. — , Aug 02 '19 at 00:36
There are some contextual things that are not clear yet, like whether "mainbranch" is the name of an actual branch in your fork or a reference to the main branch, usually named "master". You probably meant to do `git pull upstream` without "mainbranch". The question of creating the branch *before* vs. *after* updating from the "upstream" remote is big: because you did `checkout -b`, you switched to that new branch, which means you're pulling into different branches in those two cases. — madreflection, Aug 02 '19 at 01:15

score 6 · Answer 1 · answered Aug 02 '19 at 03:39

TL;DR

I think what you want, as a canned recipe here, is:

git clone https://github.com/myfork/project.git
cd project
git remote add upstream https://github.com/CompanyX/project.git
git fetch upstream
git push --force origin upstream/master:master

but that last command is very suspect. Even if it's right, you need to do more. To know for sure which command you really should use, you will need to know a lot more. So ... here goes.

Long

Am I doing something wrong ...

Probably. :-) Well certainly, since you're not getting the results you want. See madreflection's comment to get started. There is a whole lot of stuff you need to know / learn, to use this effectively. There are tons of Git tutorials out there, plus several books, but many are not very good. One free one, Pro Git, is pretty good and tends to be not too far out of date. But Git is constantly evolving: good documentation from 2010, five years after Git's official debut, is bad documentation today.

One of the biggest things to learn—or maybe unlearn—is what exactly a branch is, in Git, and what it isn't. If you have used other version control systems (or source management systems or whatever phrase you like here), you probably have a lot of expectations about how branches should work. Git won't meet most of those expectations. Other VCSes are often sure that branches, or more precisely, branch names, are the be-all and end-all of version control. Git thinks branch names are mostly just decorative fluff.

The other really big thing to deal with is that Git is truly distributed. Many VCSes are centralized: there's the One True repository, and what it says, goes. If you have a copy, well, your copy is a mere copy. They determine reality and you just have to submit to them. Git disagrees. Each repository is its own king of its own domain. Your master is as good as, or really better than, his master and her master and every other master. (Of course, her Git thinks her master is better than your master.) The central master? There is no such thing, that's just another Git repository with another master.

In any case, the term branch is actually ambiguous in Git. Sometimes we mean one thing, and sometimes we mean another thing. I like to be precise and say branch name when I mean, specifically, a text string like master. In general, though, when someone says a branch, you have to ferret out somehow whether they mean a branch name, or some sequence of commits, ending at one particular commit whose hash ID is _____ (fill in the blank), or even one of the other things that people loosely call a branch, such as a remote-tracking name like origin/master (see below).

What exactly is a fork, if it is not a duplicate of something? Is a fork yet another misnamed term in git?

Git—by which I mean the Git suite of programs—doesn't have forks. Forking is not a Git concept. It's something that web hosting services, such as GitHub and Bitbucket, add on to what Git does have. Still, the question what exactly is a fork is a good one. The problem is that to get an exact answer, you'll have to say which web hosting site you are using. In this case, it's github.com, so we can define fork using GitHub's definition. (Bitbucket's is remarkably similar, though.)

On GitHub, a fork is a clone with a couple of added properties:

In a typical clone, you start with no branches, and then Git immediately adds one branch. In a GitHub fork-clone, you start with N branches, where N is the number of branches that exist in the repository you are cloning.
In a typical clone, your Git creates some remote-tracking names. For instance, if you clone http://github.com/git/git/ today, you'll get five remote-tracking names.¹ If you use the defaults, these five remote-tracking names will be named origin/maint, origin/master, origin/next, origin/pu, and origin/todo. In your GitHub fork, these do not exist—they're replaced with the branch names. For instance, if you fork that repository, you'll get branch names maint, master, next, pu, and todo: the same five names, but without the origin/. The five names would be branch names rather than remote-tracking names.
Finally, and most important, GitHub will remember that your fork was cloned from whatever repository it was cloned from. That enables GitHub to offer you a bunch of GitHub-provided features: GitHub's thing that they call pull requests, GitHub's ability to create issues on an "issues" page, and so on. None of these are actual Git things. They're there as a sort of added value, to make you want to use GitHub and (GitHub's owners hope at least) eventually pay good money to GitHub. :-)

¹I'm omitting a sixth name you'll see—origin/HEAD—as it's a symbolic reference, which makes it very slightly different from a regular remote-tracking name. This origin/HEAD thing is for a Git feature that, in my opinion, has no actual usefulness, and can therefore be ignored.

Some specific Git commands

git clone https://github.com/myfork/project.git
cd project

At this point, you have made a clone—a Git repository—on your machine, filled in by copying, from a Git repository over on GitHub. Based on the name myfork/project.git, presumably that Git repository over on GitHub is itself a copy of yet another Git repository. You're now in this clone.

There's a particular bit of weirdness about git clone, though. It consists of, in essence, the following sequence of commands (but with error checking, and without affecting where your shell is, so that you have to cd project yourself too):

mkdir project
cd project
git init
git remote add origin https://github.com/myfork/project.git
git fetch
git checkout <some-branch-name>

The init step makes a totally empty repository: no branch names, no commits, nothing. The git remote add step sets up the name origin to refer to the URL, and the git fetch step calls up the other Git at the URL and fills in your repository. This obtains all of their commits, by hash IDs. It gets from them a list of all of their branch names, but then instead of creating actual branch names in your repository, creates the origin/* remote-tracking names.² So at this point, you have no branch names, and there is no name for git checkout to check out.

Nonetheless, your Git then combs through the branch names they sent. They typically recommend one particular one, usually master. If you don't tell git clone otherwise, your Git takes their Git's recommendation here. I'm going to assume the recommended name was master. So your Git then runs:

git checkout master

even though you don't have a master.

This git checkout creates your name master. It does so by finding your origin/master—the name your Git created during git fetch, that remembers for your Git, where their master was. So now you do have a master name, identifying one particular commit. Moreover, your implied git checkout does three more things:

It makes your Git's name HEAD refer to master.
It copies the contents of the tip commit (see below) to the index (which we won't define here).
It copies the contents of the index, just filled in, to the work-tree: the place where you can see and work with your files.

Every branch name³ always identifies one commit, by the commit's raw hash ID. Git calls this commit the tip commit of the branch. To see how these work we'll have to look closely at commits, but before we do that, let's go on to the rest of these commands:

git submodule update

This is a complicated one. If at all possible, let's just skip it. (To see why it muddies the waters, a submodule is just another Git repository. If you thought three repositories—the original + your GitHub fork + your clone—were bad, now you've just involved at least two more repositories in the picture. If we are lucky, those two repositories play no part in the next few steps, so that we can ignore them.)

git checkout -b newbranch

This asks your Git to create another name, newbranch, that identifies the same commit that your Git checked out because of the last step of git clone. (You could have picked some other commit here, but you didn't, so it used the current commit—the one your Git checked out earlier. Your Git didn't have to switch commits, so it left everything else alone: your index and work-tree still match the tip commit of master. There's no obvious right commit to use so that's OK, but there's also no obvious reason to create a new name here.)

What this all means is that all the commits that are on master are now also on newbranch. The tip commit of master is also the tip commit of newbranch.

git remote add upstream https://github.com/CompanyX/project.git

This one is pretty simple: it creates another name, upstream, to hold the URL, and sets up a standard refspec for git fetch upstream. Usually we don't have to care about the innards of the refspec: we can just think of it as take all of their branch names and rename them to make our upstream/* remote-tracking names.

git fetch upstream  # <- hoping to update my fork

This has your Git call up the Git at the URL you just set, and converse with that Git. Your Git asks their Git: What commits do you have (by their hash IDs)? What branch names do you have? Your Git compares their commit hash IDs with the ones that your Git already has, thanks to the earlier git fetch origin. For any commit hash ID they have that you don't, your Git gets those commits from this other Git.

So now, after getting all the commits they had that you didn't—and keeping the ones you already have, that they had, plus any that you have, that they don't—your Git is, if anything, better than theirs (just as your Git always believes :-) ). Meanwhile it takes their branch names—and tip commit IDs—and stuffs those into your own remote-tracking names, under upstream/*.

So far, everything is actually OK. There's only one real oddity at this point. You created a (local) branch name newbranch that identifies the same commit as your (local) branch name master, and then switched to that name. But now things go very wrong...

²Git actually calls these remote-tracking branch names. I decided that this makes it too easy to leave out the word names: you end up with remote-tracking branches. That makes them sound like they work the same as (local) branches, which is just false enough to be a problem. Dropping the word branch, and calling them remote-tracking names, fixes the problem.

In some ways, what you call them doesn't matter. You could call them Freds or Barneys (or Wilmas, etc.). What they do is remember what your Git saw on the other Git—the one over on origin, for the origin/* names—the last time your Git called up their Git and talked with it about commits and branch names. That's the important part: your remote-tracking names keep track of some other Git's branch names. But they get out of date, because your Git does not assume that you are on line to the net 100% of the time. Your Git only updates your remote-tracking names during a git fetch or git push, while your Git is actively talking to their Git.

³Again, I'm ignoring the special case of symbolic references. Someday Git might have symbolic references that work right. For now, only HEAD really works right. In what's probably an outright bug, if you create a regular branch name that is a symbolic reference, then ask Git to delete that name, Git instead deletes the target of the symbolic reference. Yikes!

pull = fetch + merge, and I don't think you want to merge

I always recommend that Git newbies avoid git pull, because its syntax is weird—it leads people down the garden path—and because when something does go wrong, it leaves you at sea about what to do. Fundamentally, though, git pull just means run two Git commands. The first of these two git commands is git fetch:

git pull upstream mainbranch

starts by running git fetch upstream. You already did that, so you don't need to do it again, unless the Git over at upstream is so active that it's changed since you last ran git fetch.⁴

Having run that first Git command, git pull then runs a second Git command. By default—and apparently in your case too—this is git merge. What git merge does is fairly complicated, when you get into all the weird special corner cases, but in general the idea is simple: Merge is about combining changes.

The problem here is that Git doesn't store changes. Git stores snapshots. It's time now to dive into commits.

⁴This does happen sometimes! If you did a very long slow git fetch to an active repository, another git fetch might pick up something new.

Commits are snapshots plus metadata

Each commit stores a full, complete snapshot of all of your files (actually copied from the index, but I promised not to go into that much detail here). These are not changes! They're just copies.⁵ Each commit is, as we already noted in passing, identified by a unique hash ID. The git log command, for instance, prints out these hash IDs. They're not very useful to humans, but they are literally the key for Git: Git stores most of its internal data as objects, which go into a key-value database. The hash ID is the key; the value is some content, such as the stuff Git needs to know to reconstruct a commit in your work-tree later.

One of the items in each commit is a parent hash ID. Technically, it's zero-or-more parent hash IDs, but most commits have exactly one. This hash ID is the ID of the commit that comes before the commit Git is looking at. Git calls that the parent of the commit.

If we use single uppercase letters to stand in for actual hash IDs, we can draw this. Suppose we have a simple string of commits, with earlier ones on the left and later ones on the right:

... <-F <-G <-H ...

Commit H has G listed as its parent. So if Git can somehow find the hash ID for H, Git can extract H from its all-objects database. In there, Git will find the hash ID for commit G. Git can use that to extract G, and in there, Git will find the hash ID for F.

What this means is that Git only needs to know the hash ID of the last commit in the chain. Let's say H really is the last one:

...--F--G--H   <-- master

The name master holds the raw hash ID of commit H, which lets Git find H in its database. From there, Git can work backwards, to G, then F, and so on. Eventually, Git will reach a commit with no parent, which is where the chain ends—or starts, depending on how you look at it.

These backwards chains of commits, found by some branch name, are Git's branches—that is, the other meaning of branch: a series of commits, ending at one particular commit that we select. Usually we select it by a branch name, but we can make up unnamed branches by picking a hash ID (perhaps from git log output) and using that. Whatever hash ID we pick, that's the last one in a chain. The chain itself is formed by the parent hash IDs, all stored, frozen forever, inside the commits themselves.

To make a new commit, we have Git check out the currently-last one—commit H for instance—into our index and work-tree. Then we do stuff, and eventually, git commit to make a new commit. Git assigns us a new, random-looking hash ID for a new commit that stores H's hash ID, plus a snapshot of our source, then writes the new hash ID into the name master:

...--F--G--H--I   <-- master

and we've grown our branch.

⁵Underneath, in pack files which hold multiple objects all at once, Git does use delta-compression, in a sneaky and clever way. But at the level at which Git deals with files, they're all full snapshots.

Merges

What git merge does is to take our own backwards-looking chain:

          I--J   <-- master
         /
...--G--H
         \

and ours or someone else's work, on another backwards-looking chain:

          I--J   <-- master
         /
...--G--H
         \
          K--L   <-- whatever

It then finds the best shared / common commit, which in this case is H because that's where the two branches join up in the past. This common commit is the merge base of the two branches. Git then uses git diff --find-renames twice: once to compare H to our latest work in J, and then again to compare H to their latest work in L.

The merge process then combines—or fails to combine—these two sets of changes, applying them to the snapshot in H. If Git can combine everything on its own, git merge goes on to make a merge commit, which is special in only one way: it has two parents. The snapshot for the merge is H-plus-the-combined-changes. This keeps our changes while also adding their changes:

          I--J
         /    \
...--G--H      M   <-- master
         \    /
          K--L   <-- whatever

so that the difference from J to M is basically what merge added from them. By the same token, the difference from L to M is basically what merge added from us. In any case, having made this two-parent commit—the first parent is J because that was our previous one, and the second is L because that's the one we merged—Git updates the name master, because that's the branch we ran git checkout on earlier.

This process works regardless of whether we have Git locate commit L through one of our names, like develop or feature, or through one of our remote-tracking names like origin/master or upstream/feature or whatever. The key is not the branch name but rather the commit. We used the name to supply the hash ID. This leads to a perhaps-surprising trick that is the key that makes distributed Git repositories work: Every Git in the universe will compute the same hash ID for a commit that is exactly, totally, 100% the same as the one we are looking at / making right now.⁵

In the end, it's really the commit hash IDs that matter. The branch names, or remote-tracking names, or whatever other names we might use to find the commits, are meant for us humans. They're a good idea, to be sure. But they're not important to Git itself. The exception to this rule is when using push and fetch.

⁵Making all that work is partly relatively easy—Git uses a cryptographic hash so that no one can spoof a commit—and partly hard: the contents of our commit needs to be unique, and different if we make another commit using the same snapshot. To that end, the commits already include the history—via the hash ID of the parent—but also a date-and-time stamp. If we make otherwise-identical commits, but it takes us a few seconds, we get a different time-stamp, so that the commits are different, and get different hash IDs. You can use the computer to make identical commits very quickly, and then you really do get the same hash ID—but these two commits must necessarily use the same parent too, and same author and everything, so it all works out OK in the end anyway.

`push` and `fetch` really do require names, but aren't quite opposites

When you use git fetch, you have your Git call up some other Git. The other Git lists its branch names (and tag names) and their commit hashes. Your Git gets any commits it needs, and updates your remote-tracking names. Hence their master has no effect on your master: your Git only updates origin/master or upstream/master.

When you use git push, it works pretty similarly: you have your Git dial up another Git. The other Git lists its branch names (not really useful except for matching mode pushes, which were the default long ago but aren't any more), but now instead of getting commits from them, you—or your Git—give commits to them: any that you have, that they will need and don't have yet. Then your Git asks them, politely: If you don't mind, would you please set your branch name _____ to ________? Fill in the blanks: the first one is a branch name and the second is a commit hash ID.

Note that you don't ask them to set a remote-tracking name, or anything like that. You ask them, instead, if they're willing to change their branch name. That's their branch name! If they're doing any work on it, and they made new commits, this could lose their commits. Suppose, for instance, that they have their master pointing to commit L, and you have yours pointing to J, in:

          I--J   <-- master
         /
...--G--H
         \
          K--L   <-- origin/master

You give them your I-J, where the parent of I is H. Then you ask them to set their master to remember J. If they do it, they will lose the ability to find commits L and K, because their Git will start at the new tip—I—and work backwards and only be able to find J, then I, then H, and so on.

They will in general refuse this request. You should, in this case, now run git merge—the second half of git pull—so that you can make merge commit M:

          I--J
         /    \
...--G--H      M   <-- master
         \    /
          K--L   <-- origin/master

Now you can send them I-J-M, where M reaches back to both J and L, and ask them to set their master—your origin/master—to point to M. If they accept—and they probably will this time—your Git will know that their master now points to the now-shared M and will update your own origin/master:

          I--J
         /    \
...--G--H      M   <-- master, origin/master
         \    /
          K--L

Note that this all still works with three, or more, repositories involved. The only change is which names you have: instead of just master and origin/master, you may also end up with upstream/master.

But: what if you want them to forget their commits? Suppose you have:

          I--J   <-- master, origin/master
         /
...--G--H
         \
          K--L   <-- upstream/master

Here, you might want to throw out your I-J entirely (which you can do later) and tell origin to set their master to match your upstream/master: commit L. They would normally refuse, so instead of git push, you can use git push --force:

git push --force origin upstream/master:master

This uses upstream/master as the way to find commit L in your repository. Your Git then calls up the Git at origin, sends them any commits they need—probably K-L here—and commands them, because of --force, to set their master to point to K.

Assuming this works—it's up to whoever owns the Git repository at origin to set up these rules, and GitHub provide ways to protect branch names to disallow force pushing, or any pushing at all, if you like—but assuming this works, your Git now updates your origin/master to remember that they said OK, I obey your command. So now you have:

          I--J   <-- master
         /
...--G--H
         \
          K--L   <-- origin/master, upstream/master

This works even if they (upstream/master) are hundreds of commits ahead, or wildly divergent, or whatever. You just use your upstream/master to command your GitHub fork at origin to set its master—your origin/master—to the commit you want it to use.

But now you need to update your own repository. Here, git reset --hard may be the right answer. Or maybe not: maybe you want to save any commits you have that neither of the other two Gits have, rebasing them onto your now-updated origin/master. Exactly what you need and want to do here with Git depends on what result you want. But at least, at this point, origin/master and upstream/master (in your own Git) match, and identify the commit whose hash ID every Git in the universe agrees-on.