1

I originally had a SVN repository with multiples projects in it.

I converted it to a GIT repository using SubGit.

The repository:

MyRepository/
    .git/
    Project1/
    Project2/
    Project3/
    ...

This repository has multiple branches (v1, v2, v3, v4, v5, master).

I'm trying to split my repository into multiple repositories, keeping the full history and branches in each repository.

I was able to do it using this script https://stackoverflow.com/a/26033230/2558653

#!/bin/bash

repoDir="C:\Sources\MyRepository"
folder="Project1"
newOrigin="https://gitlab.com/myUser/project1.git"

cd $repoDir

git checkout --detach
git branch | grep --invert-match "*" | xargs git branch -D

for remote in `git branch --remotes | grep --invert-match "\->"`
do
        git checkout --track $remote
        git add -vA *
        git commit -vam "Changes from $remote" || true
done

git remote remove origin
git filter-branch --prune-empty --subdirectory-filter $folder -- --all

#prune old objects
rm -rf .git/refs/original/*
git reflog expire --all --expire-unreachable=0
git repack -A -d
git prune

#upload to new remote
git remote add origin $newOrigin
git push origin master

for branch in `git branch | grep -v '\*'`
do
        git push origin $branch
done

It worked, but now the repository of my subfolder is 2,7GB, while the current files are only 20 MB.

I tried this command to list large files in the repository , and I remarked that some files are not in my subproject and should have been removed :

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

... 
00a2e4e398bd1805ad2524d86276ee72216c1f67 OtherFolder/Distribution/NsiScripts/file.exe
...

Is there a way to way to adapt the script to remove all files that are not in my subfolder and reduce the size ? Is that what the script was supposed to do?

Gab
  • 1,861
  • 5
  • 26
  • 39
  • The `git filter-branch` command you listed fails to use `--tag-name-filter cat` so it will leave old tags behind. If your repository has tags, these may retain all the old commits. I suspect that's the source of the repository bloat here. – torek Feb 12 '20 at 22:19
  • I don't have any tag – Gab Feb 13 '20 at 16:36
  • I have a lot of merge commits that touch multiple projects. Maybe there are not managed correctly by the script. – Gab Feb 14 '20 at 16:11
  • "This repository has multiple branches (v1, v2, v3, v4, v5, master). I'm trying to split my repository into multiple repositories, keeping the full history and branches in each repository." . . . huh? Do you mean you want one repo per ProjectX subdirectory, with each repo having all (bu only) the commits and branches whose histories touch that subdirectory? – jthill Feb 14 '20 at 16:46
  • If you want to create separate commit histories including only the commits that touch a subdirectory, [see here](https://stackoverflow.com/a/54797306/1290731). – jthill Feb 18 '20 at 02:37
  • Thanks. Yes, this is exactly what I want to do, I will try the script and compaare the results – Gab Feb 19 '20 at 06:43

2 Answers2

0

I have a project with a utils library that's started to be useful in other projects, and wanted to split its history off into a submodules. Didn't think to look on SO first so I wrote my own, it builds the history locally so it's a good bit faster, after which if you want you can set up the helper command's .gitmodules file and such, and push the submodule histories themselves anywhere you want.

The stripped command itself is here, the doc's in the comments, in the unstripped one that follows. Run it as its own command, with subdir set, like subdir=utils git split-submodule if you're splitting the utils directory. It's hacky because it's a one-off, but I tested it on the Documentation subdirectory in the Git history.

#!/bin/bash
# put this or the commented version below in e.g. ~/bin/git-split-submodule
${GIT_COMMIT-exec git filter-branch --index-filter "subdir=$subdir; ${debug+debug=$debug;} $(sed 1,/SNIP/d "$0")" "$@"}
${debug+set -x}
fam=(`git rev-list --no-walk --parents $GIT_COMMIT`)
pathcheck=(`printf "%s:$subdir\\n" ${fam[@]} \
    | git cat-file --batch-check='%(objectname)' | uniq`)
[[ $pathcheck = *:* ]] || {
    subfam=($( set -- ${fam[@]}; shift;
        for par; do tpar=`map $par`; [[ $tpar != $par ]] &&
            git rev-parse -q --verify $tpar:"$subdir"
        done
    ))
    git rm -rq --cached --ignore-unmatch  "$subdir"
    if (( ${#pathcheck[@]} == 1 && ${#fam[@]} > 1 && ${#subfam[@]} > 0)); then
        git update-index --add --cacheinfo 160000,$subfam,"$subdir"
    else
        subnew=`git cat-file -p $GIT_COMMIT | sed 1,/^$/d \
            | git commit-tree $GIT_COMMIT:"$subdir" $(
                ${subfam:+printf ' -p %s' ${subfam[@]}}) 2>&-
            ` &&
        git update-index --add --cacheinfo 160000,$subnew,"$subdir"
    fi
}
${debug+set +x}

#!/bin/bash
# Git filter-branch to split a subdirectory into a submodule history.

# In each commit, the subdirectory tree is replaced in the index with an
# appropriate submodule commit.
# * If the subdirectory tree has changed from any parent, or there are
#   no parents, a new submodule commit is made for the subdirectory (with
#   the current commit's message, which should presumably say something
#   about the change). The new submodule commit's parents are the
#   submodule commits in any rewrites of the current commit's parents.
# * Otherwise, the submodule commit is copied from a parent.

# Since the new history includes references to the new submodule
# history, the new submodule history isn't dangling, it's incorporated.
# Branches for any part of it can be made casually and pushed into any
# other repo as desired, so hooking up the `git submodule` helper
# command's conveniences is easy, e.g.
#     subdir=utils git split-submodule master
#     git branch utils $(git rev-parse master:utils)
#     git clone -sb utils . ../utilsrepo
# and you can then submodule add from there in other repos, but really,
# for small utility libraries and such, just fetching the submodule
# histories into your own repo is easiest. Setup on cloning a
# project using "incorporated" submodules like this is:
#   setup:  utils/.git
#
#   utils/.git:
#       @if _=`git rev-parse -q --verify utils`; then \
#           git config submodule.utils.active true \
#           && git config submodule.utils.url "`pwd -P`" \
#           && git clone -s . utils -nb utils \
#           && git submodule absorbgitdirs utils \
#           && git -C utils checkout $$(git rev-parse :utils); \
#       fi
# with `git config -f .gitmodules submodule.utils.path utils` and
# `git config -f .gitmodules submodule.utils.url ./`; cloners don't
# have to do anything but `make setup`, and `setup` should be a prereq
# on most things anyway.

# You can test that a commit and its rewrite put the same tree in the
# same place with this function:
# testit ()
# {
#     tree=($(git rev-parse `git rev-parse $1`: refs/original/refs/heads/$1));
#     echo $tree `test $tree != ${tree[1]} && echo ${tree[1]}`
# }
# so e.g. `testit make~95^2:t` will print the `t` tree there and if
# the `t` tree at ~95^2 from the original differs it'll print that too.

# To run it, say `subdir=path/to/it git split-submodule` with whatever
# filter-branch args you want.

# $GIT_COMMIT is set if we're already in filter-branch, if not, get there:
${GIT_COMMIT-exec git filter-branch --index-filter "subdir=$subdir; ${debug+debug=$debug;} $(sed 1,/SNIP/d "$0")" "$@"}

${debug+set -x}
fam=(`git rev-list --no-walk --parents $GIT_COMMIT`)
pathcheck=(`printf "%s:$subdir\\n" ${fam[@]} \
    | git cat-file --batch-check='%(objectname)' | uniq`)

[[ $pathcheck = *:* ]] || {
    subfam=($( set -- ${fam[@]}; shift;
        for par; do tpar=`map $par`; [[ $tpar != $par ]] &&
            git rev-parse -q --verify $tpar:"$subdir"
        done
    ))

    git rm -rq --cached --ignore-unmatch  "$subdir"
    if (( ${#pathcheck[@]} == 1 && ${#fam[@]} > 1 && ${#subfam[@]} > 0)); then
        # one id same for all entries, copy mapped mom's submod commit
        git update-index --add --cacheinfo 160000,$subfam,"$subdir"
    else
        # no mapped parents or something changed somewhere, make new
        # submod commit for current subdir content.  The new submod
        # commit has all mapped parents' submodule commits as parents:
        subnew=`git cat-file -p $GIT_COMMIT | sed 1,/^$/d \
            | git commit-tree $GIT_COMMIT:"$subdir" $(
                ${subfam:+printf ' -p %s' ${subfam[@]}}) 2>&-
            ` &&
        git update-index --add --cacheinfo 160000,$subnew,"$subdir"
    fi
}
${debug+set +x}
Rajesh kumar R
  • 214
  • 1
  • 8
0

Finally, the script I used was working, but I was taking all branches. And my "Project2" folder was not in some old branches, So it kept all the folders for these branches...

So I had to take only the branches where the "Project2" exists

for remote in `git branch --remotes | grep --invert-match "\->\|origin/v1\|origin/v2\|origin/v3"`

And if someone try to keep mutliple folders when spliting, it can be done using :

git filter-branch --index-filter 'git rm --ignore-unmatch --cached -qr -- . && git reset -q $GIT_COMMIT -- Project1/ Project2/' --prune-empty -- --all
Gab
  • 1,861
  • 5
  • 26
  • 39