1

BRIEF:

How do you ensure that there is no unsaved work in any DVCS's distributed repository clones?

I am thinking about this specifically for Mercurial, but it also applies to git, bzr, etc.

DETAIL:

Back in the bad old days I used to run cron jobs that might do the equivalent of - pseudocode, because I may not remember the CVS commands:

find all checked out CVS trees
   do a cvs status command (which I think is something like cvs update -n?) 
   | grep '^M' to find all modified files not yet committed to the central repo

(These days were bad (1) because we were using CVS, and (2) because from time to time I was the guy in charge of making sue nothing got lost. OK, that last was not so bad, but ulcerating.)

Q: how do I do the equivalent for a modern DVCS system like Mercurial. I thought it was easy, but on closer inspection there are pieces missing:

I started off by doing something like

find all ...path/.hg directories, and then look at ...path
    do hg status - look at the output  // this is easy enough
    do hg outgoing // this is where it gets interesting

You might thing that doing an hg outgoing is good enough. But it isn't necessarily.

Consider:

cd workspace-area
hg clone master repo1
hg clone repo1 repo2
rm -rf repo1
hg clone repo2 repo1

Now repo1's default path is repo2, and vice versa.

Of course, this won't happen if you have the right workflow. If you only ever clone from something upstream of you, never from a peer. But... lightweight cloning is part of the reason top do DVCS. Plus, it has already happened to me.

To handle this problem,I usually have an hg path somewhere, set up in my ~/.hgrc, set to some project-master URL. This works fine - for that one project. Not so fine if you have many, many projects. Even if you call them project1-master project2-master, etc., there just get to be a lot of them. Worse still if subrepos are proliferating because of libraries that want to be shared between projects.

Also, this has to be in the user's .hgrc. Or a site .hgrc. Not so good for somebody who may not have that .hgrc set up - like an admin who doesn't know the ins and outs of each of several dozen (or hundreds) of projects on his systems - but who still wishes to do his users the favor of finding stale work. (They may have come to expect it.) Or if you simply want to give standard instructions as to how to do this.

I have considered putting the name of some standard master repo for the project (or a list) in a text file, checked into the repo. Say repo/.hg_master_repos. This looks like it may work, although it has some issues (you may only see the global project master, not an additional local project master. I don't want to explain more than that.).

But... before I do this, is there any standard way of doing this?


By the way, here is what I have so far:

#!/usr/bin/perl
use strict;

# check to see if there is any unsaved stuff in the hg repo(s) on the command line

# -> hg status, looking for Ms, etc.
#        for now, just send it all to stdout, let the user sort it out

# -> hg outgoing
# issue: who to check outgoing wrt to?
#   generic
#      a) hg outgoing
#           but note that I often make default-push disabled
#           also, may not point anywhere useful, e.g
#               hg clone master r1
#               hg clone r1 r2
#               rm -rf r1
#               hg clone r2 r1`
#           plus, repos that are not clones, masters...
#      b) hg outgoing default-push
#      c) hg outgoing default
#   various repos specific to me or my company


foreach my $a ( @ARGV ) {
    print "**********  $a\n";
    $a =~ s|/\.hg$||;
    if( ! -e "$a/.hg" ) {
        print STDERR "Warning: $a/.hg dos not exist, probably not a Mercurial repository\n";
    }
    else {
        foreach my $cmd (
                 "hg status",
                 # generic
                 "hg outgoing",
                 "hg outgoing default-push",
                 "hg outgoing default",
                 # specific
                 "hg outgoing PROJECT1-MASTER",
                 "hg outgoing MY-LOCAL-PROJECT1-MASTER",
                 "hg outgoing PROJECT2-MASTER",
                 # maybe go through all paths?
                 # maybe have a file that contains some sort of reference master?
                )
          {
              my $cmd_args = "$cmd -R $a";
              print "=======  $cmd_args\n";
              system($cmd_args);
          }
    }
}

As you can see, I haven't adorned it with anything to parse what it gets - just letting the user, me, eyeball it.

But just doing

find ~ -name '*.hg' | xargs ~/bin/hg-any-unsaved-stuff.pl

found a lot of suspiciously unsaved stuff that I did not know about.

Old unsaved changes reported by hg status are highly suspicious. Unpushed work reported by outgoing is suspect, but perhaps not so bad for somebody who thinks that a clone is a branch. However, I prefer not to have a diverged clone live forever, but to but things onto branches so that it somebody can see all the history by cloning from one place.

BOTTOM LINE:

Is there a standard way of finding unsaved work, un-checked-in and/or unpushed, that is not vulnerable to the sorts of cycles I mention above?

Is there some convention for recording the "true" project master repo in a file somewhere?

Hmm... I suppose if the repos involved in pushes and clones wand checkins were recorded somewhere, I could make some guesses as to what the proper project masters might be.

Krazy Glew
  • 7,210
  • 2
  • 49
  • 62
  • Surely if any repo has changes not committed to *its* upstream, then you know it has unsaved work? – tc. Jul 21 '12 at 03:25
  • @tc: I tried to explain this in my question. hg clone master r1; hg clone r1 r2; rm -rf r1; hg clone r2 r1 // Anyway, whether or not cycles in repos can be solved, my question remains: is there a standard way of doing this. Ideally one that handles cycles, but I suspect not. // E.g. if you are looking at a directory full of 100 workspaces, how to determine which ones have unsaved work? Especially if many are clones of clones, and whose default, the repo they were cloned from, may no longer exist. – Krazy Glew Jul 21 '12 at 03:54
  • @tc: and, yes, the answer may be "bad workflow, don't do that". But bad workflow happens. Perhaps my scripts should look for ill-formed repository graphs - cycles, parent repos (default and default-push) not existing, etc. - and warn about that. – Krazy Glew Jul 21 '12 at 03:58
  • I'd count that as "bad hg"; the idea that you need to keep track of copies of repositories to do branching seems a little backwards. That said, you can just follow the link to the upstream repository until you either hit a remote repository or a cycle (e.g. via the tortoise-and-hare algorithm or Brent's algorithm). – tc. Jul 21 '12 at 04:20
  • @tc: I think that keeping copies of repositories as records of experimental but failed branches is backwards. But many Mercurial and other DVCS folk have the idea that named branches are unnecessary, you just need to keep clones. Plus, if changes have been made in a Mercurial repo on an unnamed branch that turns out to be a dead-end, it is annoying to push them. In Mercurial you cannot retroactively change the name of the branch that revsets belong to. (I need to try mq for stff like this.) – Krazy Glew Jul 21 '12 at 14:38
  • Please edit the question to make it more precise and concise and avoid long discussions in the comments. @tc: You don't need clones for branches in Mercurial — that's just a workflow some developer like. Many people use named branches or bookmarks instead. – Martin Geisler Jul 23 '12 at 02:16

2 Answers2

1

Here's what you can do:

  1. Identify the possible central repositories on your server.

  2. Iterate over repositories on the client to match them up with central repositories.

  3. Run hg outgoing against the central repository you found.

A bit more detail:

  1. I assume you have a central place for your repositories, since otherwise your question becomes moot. Now, a repository can be identified by the root changeset. This changeset will be revision zero and you can get the full changeset like this:

    $ hg log -r 0 --template "{node}"
    

    Run a script on the server that dumps a list of (node, URL) pairs into a file that is accessible by the clients. The URLs will be the push targets.

  2. Run a script on the clients that first downloads the (node, URL) list from the server and then identifies all local repositories and the corresponding push URL on the server.

  3. Run hg outgoing URL with the URL you found in the previous step. You can (and should!) use a full URL with hg outgoing so that you avoid depending on any local configuration done on the client. That way you avoid dealing with default and default-push paths and since the URL points back to the server you know that it's a good URL to compare with.

    If the server has multiple clones of the same repository, then there will be several different URLs to choose from. You can then either try them all and use the one with fewest outgoing changesets for your report or you can side-step the issue by combining the clones on the server-side (by pulling changesets from all the clones into a single repository) and then compare against this combined repository.

When you run the script on the client you might have some repositories that are local and don't exist on the server. Your script should handle those: it should probably fire off an email to the developer asking him to create the repository on the server.

Finally, a repository might have more than one root changeset. The above will still work pretty well: all clones done the normal way will keep revision zero the same on both server and client. The script will therefore correctly match up the client repo with the server repo, even with multiple roots.

It is only if a developer runs something like hg clone -r the-other-root ... that the above fails since the other root now becomes revision zero. The repository will thus be seen as a local repo. Your script should handle that anyway, so it's no big deal.

Martin Geisler
  • 72,968
  • 25
  • 171
  • 229
  • Thank you! The root changeset - that's the key I was looking for. The thing that allows me to pair up repositories. – Krazy Glew Jul 23 '12 at 06:45
0

If all you concern is a data loss and you are using git then just create a repository. Add all created repositories as a remotes to this one and run

git fetch --all

This will efficienty make backup of all data in all repositories. It also backups he current snapshot of all references.

Oleksandr Pryimak
  • 1,561
  • 9
  • 11
  • Seems good. I am now looking for the Mercurial equivalent. // However, even in git I am wondering how one does the "add all created repositories as remotes to this one" - it sounds unwieldy, especially since I have worked with some guys who prefer to use clones to branches, and who may therefore have dozens of cloned repositories. (Imagine creating a new clone for what amounts to every task branch - and then getting interrupted, switching to a different task-branch/clone, and repeat.) – Krazy Glew Jul 21 '12 at 13:54
  • // I suppose that I will need a hook. // I also suppose that I will need a convention as to what the repo that I will push all of the data will be. Hmm, same problem as before: oftentimes I want more than one such collect-all-the-unsaved-data repos: one for the overall project, and one per user or team. – Krazy Glew Jul 21 '12 at 13:58
  • One thing I should make clear: I work with a team that encourages making many, many, many, ad-hoc clones. One of them was working full-time on DVCS before git and Mercurial existed (heck, I was working with monotone and darcs and bitkeeper before git and mercurial, but not full-time). // Also, this many clones approach is suggested by one of the standard approaches for history editing in Mercurial: make a clone containing only the ancestors of a rev, effectively pruning out unmerged side branches. And then make changes to that. – Krazy Glew Jul 21 '12 at 14:07
  • Furthermore, given the lack of partial checkins and checkouts, I have been moving towards hg subrepos for every library or module that is shared individually. Which basically means a proliferation of repos. Anything where I have to "add all created repos as remotes" by hand is unfeasible. // (CVS and SVN, with their support for partial checkouts and checkins, needed fewer subrepos than any of the DVCSes I have so far used.) – Krazy Glew Jul 21 '12 at 14:10
  • @KrazyGlew can you use custom clone script? If you can then there is no actually a problem in registering all clones. – Oleksandr Pryimak Jul 21 '12 at 22:56
  • AP (BTW, AFAICT I do not need to @ you - right): *I* can use a custom clone script. But (a) that is not retroactive - some of these repos already exist, and (b) it is harder to ensure that all folks I deal with are using the same custom clone script. So I would prefer a solution that did not require a custom clone script, but I would fall back to it if necessary. – Krazy Glew Jul 22 '12 at 17:46
  • Over the years, I have learned that solutions that involve "just wrapperizing XXX" are unsatisfactory, for the reason mentioned abobe - how do you ensure that everyone is using the wrapperized version of the tool - unless you can also enforce it by disabling the original XXX except when called by the wrapper, and substituting the wrapper in its place. – Krazy Glew Jul 22 '12 at 17:48