How to deal with a large number of nested CVS projects

Question

Never done this before, so I'm probably just being a noob... I'm trying to migrate our aged CVS repository to GitLab and I'm not sure how to handle the nested CVS projects. We have a LOT of them (i.e. about 1600 .project files dotted through the CVS repo). There's about 10 years worth of commits, totalling about 21GB, over two CVS repository directories.

The geneneral structure is $client/$product but most of these contain a bunch of subprojects - often very many.

What I've tried so far:

Monolithic: tried to import the smaller CVS repo - ran out of memory on pass 1 first time (solved by adding memory) and ran out of disk space on pass 5 second time (can't really add disk as vmware datastores are nearly full - don't ask!).
By client: cvs2git completed on one client, and then ran git --fast-import, but I then noticed all the sub-projects. Git doesn't care about the merged history, but our coders will. Read up on git submodules, but not sure this is what I need, as the entire project is normally within the same CVS repo, and I see it complicates the process of cloning the project.
By project within client: using the productions from (2), recursed the CVS repo depth-first with find, looking for .project files; created a subdirectory for each and did a git init --bare on each, before importing the sub-projects with git --fast-import. This took ages, as I believe it has to munge the entire cvs2git blob and dump files every time, and I'm not sure I ended up with a proper git hierarchy.

So... rather than floundering round any more, I thought I'd ask here as I'm sure someone else must have needed to do this kind of thing. Any pointers greatly appreciated.

[edit]: Thanks for all the suggestions and help, people. It's out of my hands now - they (the devs) have decided to migrate the CVS projects piecemeal as they work them, so the majority will probably never be moved. The old cvs will be kept round as a read-only reference, for that purpose, and projects will be checked-in to git "pristine" so for any "BG" (before git) history, they will refer to cvs, but for "AG" history, they will consult git.

As for the issue of the deeply nested projects, the explanation I was given is that it relates to Java class hierarchies, and each project equates to one class. There's something in their build process that automatically changes cvs projects into java .jar files or something like that. There's a LOT of java in there.

I don't know if this advice holds true anymore, but in the past it was advisable to first convert to SVN and then to Git. This was because cvs2svn and svn2git were better developed than cvs2git. — Schwern, Dec 23 '16 at 12:19
The official cvs2git docu says to just go straight with cvs2git. I'm not sure if I've got room for 3 repositories on the server - I'm having enough space issues with just the 2 ;-) — andydj, Dec 23 '16 at 13:42
The tigris cvs2git is a kind of a fork of the cvs2svn project and is equivalent to a kind of alomost-cvs-2-svn followed by the fast-import into git. — Mort, Dec 26 '16 at 07:13
Just out of curiaosity, what is the size of your current checkout? — max630, Dec 27 '16 at 15:35
Sorry for the late reply. The entire *repo* is about 8GB, but the productions (blob and dump files) from cvs2git are much larger. For instance, for one of them (there are two top-level repos, side-by-side) they total about 15GB. — andydj, Jan 20 '17 at 07:47

score 0 · Answer 1 · answered Dec 26 '16 at 07:26

I'm not quite sure what you're asking, but here are some comments, hopefully one or more of which will answer your question.

Did you want to separately convert each individual project separately to git? I can't really tell from your question. But if you do, you can just copy each project's directory tree and run cvs2git on it. (Or even perhaps just create symlinks to save space, so long as the nesting allows it.) Loop over them one at a time. The simplicity of CVS's server-side back-end file storage is a blessing in this case.

e.g. doing this. Note that you could do some sort of a recursive copy rather than a symlink.

/opt/cvsrepos/CVSROOT
             /path/to/project1
                     /project2

/opt/convertrepos/CVSROOT #dummy empty directory to fool cvs2git
                 /project1 -> /opt/cvsrepos/path/to/project1

Can you just copy the whole cvs repository somewhere else temporarily to do the conversion, where you have more disk space and memory?
Whether you want to create one monolithic repository or lots of separate repositories is a whole opinion-based thing that is beyond the purpose of stackoverflow. It is also not clear to me if these projects require each other or not. If not, then you have more flexibility in that choice.

Thanks - that's really useful. My biggest problem, I think, is that I've only ever made minor use of CVS myself - like occasionally checking out FOSS projects to compile or managing the back end of RANCID (or is that RCV?). Anyway, I don't really "get" cvs enough to know if things like nested projects are a big issue or not - so I'll have to ask the devs. Trouble is, a lot of people have come and gone in the lifetime of this CVS repo, and not many of them are left. But we still have active customers who could ask for support on software we have built for them in the past. — andydj, Jan 03 '17 at 09:37
I'm thinking now, that maybe I should just "freeze" the CVS repo, and import each project "head" into the git repo, without history. Start with a clean slate. — andydj, Jan 03 '17 at 09:39

score 0 · Answer 2 · answered Dec 27 '16 at 15:42

0

Usually it is not possible to preserve all information which is contained in centralized repository, especially something so imperfect as CVS, while converting to git. So I think you should not try it at all. Preserve the original repository for historical reference, and convert to git only projects which are currently in development. You don't even have to import whole 10 years of their, 2-3 years would be enough.

answered Dec 27 '16 at 15:42

max630

8,762
3
30
55

I have used the tigris cvs2git to successfully convert an enormous CVS repo with 10+ years of history by many many devs, some "interesting" tagging history, and a ton of branches and tags. We had to do a few iterations to fix up various issues we found along the way, but it's entirely possible. – Mort Dec 28 '16 at 00:15
I don't think it's common for git repos to be so deeply nested as this cvs one is. I think it's just an organizational habit/policy for separation of concerns - maybe because of CVS's quirks relating to distributed workflow. I'm beginning to gravitate towards starting afresh in git with a clean checkout of each project, and no history, but retain the CVS repo for historic purposes only. – andydj Jan 03 '17 at 09:43

How to deal with a large number of nested CVS projects

2 Answers2