1

I have a git repository containing files which have some sensitive data possibly hardcoded, or formally hardcoded and now residing at some points in the git history.

In the interest of making the project publicly available so programers with similar interests can benefit form it and contribute changes back, I want to fork it an sanitize the offending files.

The procedure I considered was as follows:

  1. Shallow/Shared clone the repo locally to a new local location, this folder will become the public variant. Subsequent steps are in the new repo.
  2. Branch the master into a branch public-master
  3. Remove all other branch refs.
  4. Sanitize public-master
  5. Squash public-master
  6. git reflog expire --expire-unreachable=now --all && git gc --prune=all --agressive remove all unreachable refs, which is now any obj not in the public branch
  7. git push add the public master back upstream into the private repository.
  8. Set origin remote to public repo url, branch onto master. Push to origin.

Is this sufficient to sanitize my repo, or would it be possible to recover sensitive data after this. Is there a more sensible and common way to resolve this problem? Are any of the steps extranious?

For example can I do this all in one repository, or does the nature of git-packs mean I might still push an obj that contains sensitive information?

awiebe
  • 3,758
  • 4
  • 22
  • 33
  • To make it public obviously creating a repository from the scratch with only the latest (sanitized) commit is better. Also sensitive data normally goes to a separate file which is ignored in .gitignore from the very first commit. – b-fg Oct 25 '18 at 04:14
  • I know that, but it's a hobby project, so I used sloppy opsec. Hence why a retroactive method is necessary here. – awiebe Oct 25 '18 at 04:19
  • Ah I see, yes after having done all these steps that is really equivalent to having just produced the sanitized version and then copying the working tree into a brand new repo. The only problem is I want to be able to pull from the private repo, and then they would have unshared history. – awiebe Oct 25 '18 at 04:21
  • @b-fg . So I guess perhaps a better question is, given a brand new repo, how can I graft the new public branch into my old private repo. Then new features are put in the public repo, and pulled back into the private one. – awiebe Oct 25 '18 at 04:24

2 Answers2

2

The only problem is I want to be able to pull from the private repo, and then they would have unshared history.

That seems unavoidable, since you have change the branch history and squash it.

Instead of pulling from the new public repo, I would simply consider changes done one the new repo clone and decide which one I want to add to the local clone of the old private repo:

# update local content of new repo
cd /path/to/public/repo 
git pull

# check what needs to be added
cd /path/to/clone/of/old/repo
git --work-tree=/path/to/public/repo add -p .

You will see the diffs between old and new, coming from possible new evolution done on the public repo.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Oh, you know I didn't get why the work-tree arg was there before. I think the proper solution is nearly here. To be clear I would only pull public changes back into the old repo, not the other way around as that would risk contaminating public objs. – awiebe Oct 25 '18 at 04:57
  • @awiebe OK, then reverse the paths. – VonC Oct 25 '18 at 05:00
  • I'm going to accept this for now and write up a more detailed account of what I finally wound up doing. – awiebe Oct 25 '18 at 05:10
2

Combining @VonC and @b-fg answers I think the most sensible solution is as follows. Observe that it is very easy to contaminate the new public repository with objs that may contain sensitive date, instead build a new one to be the public one.

  1. Branch the private repository into public
  2. Sanitize public
  3. Init new repo for public.
  4. git --work-tree=/path/to/private add -p . Cause git to run with the public index but the private sanitized working tree. The public repo now has all of the sanitized branch's working tree staged so git commit.
  5. The local repository has the working tree from the sanitized branch in the index, but does not have the working tree, in other words it looks to git like everything in the current working tree was deleted from the index. "Restore" the files to the new repo's working tree with git reset --hard
  6. Switch back to the private repository and add the public repository as a remote. git remote add public file://path/to/public/repo
  7. The history private/public and public/master are now disjoint. So we need to graft them together. Set the upstream of private/public using git branch -u public master now pull allowing disjoint history git pull --allow-unrelated-histories
  8. Set the public branch as only being able to read, but not write changes, to prevent accidental contamination of public repo git remote set-url public --push "This Branch is Read-Only"

Now only make new features on in the public repository, and pull them back into the private one as required.

awiebe
  • 3,758
  • 4
  • 22
  • 33