2

Scenario: Master repository with 100+ developers working off it

Is there a significant impact on Github storage space in the scenario where 100+ developers are forking a parent repo or is it a valid strategy for each developer to have their own fork of the repo, and then make PRs to the parent repo?

I looked through several other threads that may have some relevance to this question, but was only able to find that forks share objects to minimize storage usage. However, I was not able to figure out the level of impact on a large scale (hundreds of forks) and if this will significantly take up available storage.

isherwood
  • 58,414
  • 16
  • 114
  • 157
Julian
  • 23
  • 2
  • _Forking_, or cloning? A fork isn't usually necessary. That creates a separate repo. – isherwood Sep 23 '21 at 18:15
  • 1
    Forking. This strategy was suggested to prevent branch clutter; devs not deleting branches that have already been merged, and there are a few other issues that I won't go into detail with. When a fork is created, devs can create changes in their local and push them into their own repo. This makes it so only base branches will be existing on the master repo. But the question that's coming up is if this method will take up a substantial amount of storage space. – Julian Sep 23 '21 at 18:47
  • Thanks. It wasn't clear to me until VonC's answer that you're talking about storage _at Gibhub_. That's not something I've faced before so it wasn't in my mind. I made a small revision to help others. – isherwood Sep 23 '21 at 18:50

1 Answers1

4

A fork, on GitHub, would not duplicate (on theGitHub server side) the full repository, as explained in "Counting Objects" by Vicent Martí in 2015.

Very early on we figured out that actually forking people’s repositories was not sustainable.

For instance, there are almost 11,000 forks of Rails hosted on GitHub: if each one of them were its own copy of the repository, that would imply an incredible amount of redundant disk space, requiring several times more fileservers than the ones we have in our infrastructure.

That’s why we decided to use a feature of Git called alternates.

When you fork a repository on GitHub, we create a shallow copy of it.
This copy has no objects of its own, but it has access to all the objects of an alternate, a root repository we call network.git and which contains the objects for all the forks in the network.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Please correct me if I'm wrong, these shallow copies would just have references to the root repository's objects, so there is no real duplication of objects occurring? – Julian Sep 23 '21 at 18:50
  • Exactly, that is the idea. – VonC Sep 23 '21 at 20:38