22

Today I had this really neat idea for backing up my database: put the dump file in a git repository, then commit on each dump so that I have both the most recent copy, but can easily roll back to any previous backup. I can also easily pull a copy of the repository on a regular basis to keep a copy on my own computer as a backup of the backups. It definitely sounds clever.

However, I'm aware that clever solutions sometimes have fundamental flaws. What sort of issues might I hit storing mysqldump diffs in git? Is it worth it? What do most people do in order to have multiple database backups on the server and keep redundant copies elsewhere?

Matchu
  • 83,922
  • 18
  • 153
  • 160

4 Answers4

12

Normally you don't keep every backup (or snapshot) forever. A git repository does keep every checkin you ever make. If you ever decide to prune old revisions (say month-old revisions down to once a week, year old to once a month, etc) you will have to do it with git filter-branch which will rewrite the entire history. Then git gc to remove the unwanted revisions.

Given that git's strengths are distributed version control and complex patch/branch workflows (neither of which apply to snapshots or backups) I'd consider using a different VCS with a more malleable history.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • 1
    This is a good point. If you want to keep your database history *forever*, Git will do just that. Our shop, for example, does daily dumps but only keeps the last 7 days, while keeping weekly dumps forever. – erjiang Nov 23 '10 at 22:21
  • I'd be interested in storing a MySQL snapshot at key points in the workflow. For instance, when developing on my local development box, it might make sense to take a MySQL snapshot when I push code to a central repo (which would, by construction, be consistent with the code). Any thoughts on that? – Christian Nally Nov 21 '12 at 18:06
  • @ChristianNally: Snapshots of a configured but empty database as a means of documentation or a shortcut for developers would make perfect sense. I think the original question was about backing up a *full* database. – Ben Jackson Nov 22 '12 at 06:59
  • @BenJackson I do like the idea of empty DBs as a way to store a schema, but I did intend to store data at key points. For the slam dunk in this thread, it would be great if we could get a (hook?) script going that would retrieve i) a specific Git checkin and ii) a corresponding DB snapshot from a different storage system. (see my recent comment above re: the difference when each new DB snapshot has a different filename... that could be a mitigating factor). – Christian Nally Mar 19 '13 at 19:04
5

This approach sounds fine to me. I use Git for backing up my own important data.

Note that you are not storing diffs -- Git effectively stores snapshots of the directory state with each commit. You can generate the diff of two commits, but the actual storage mechanism has nothing to do with diff.

cdhowie
  • 158,093
  • 24
  • 286
  • 300
3

In theory this will work, but you will start to have problems when the database dumps get large.

Git doesn't have any hard file size limits, but it will diff the contents of your latest dump with the one previously stored in the repository, which will require at least as much memory as the sizes of both of those files added together - so I would imagine it will start to get very slow, very quickly with files over 100MB (or even 10MB).

Git wasn't made for dealing with files of this type (i.e. big data files instead of source code), so I think this is fundamentally a bad idea. You could, however, use something like Dropbox to store the dumps - which will still save the version history for you, but is more tailored towards files which can't effectively be diffed.

Rob Golding
  • 3,502
  • 5
  • 26
  • 28
  • -1 Git does not store commits as diffs. As your argument relies on this premise, it is invalid. – cdhowie Nov 23 '10 at 22:26
  • 3
    He didn't claim git stores commits as diffs. He merely stated that git _does_ perform a diff every time you push, for example - And these operations will be slow and consume large amounts of memory on files like this. – Core Xii Nov 23 '10 at 23:10
  • The text *"it will diff the contents of your latest dump with the one previously stored in the repository"* clearly indicates he means storage, not transfer -- since otherwise he would have mentioned that pushing multiple commits would be more efficient than pushing one at a time. I understand what he is trying to say, but due to delta compression, this really is not very accurate. I have dumps of data here where each commit represents about 1.2-1.7MB of data, with 123 commits, and the repo is 532KB. Remember that commits are delta-compressed against themselves too, not just prior commits. – cdhowie Nov 23 '10 at 23:28
  • I think that there may be a misunderstanding here depending on whether every DB snapshot's name is to include a timestamp. New DB snapshots with unique names wouldn't be diffed. But if the DB snapshot is stored with the same exact filename every time (db-snapshot.sql), then Git would start to get gnarly. – Christian Nally Mar 19 '13 at 19:01
1

If you're using MySQL (and possibly others) and have binary logging enabled, you might consider setting up a git repo for the directory of your bin log and developing a strategy to regularly commit updates to the binlog.

In MySQL, the binlog stores the queries that change data to any tables on the database. If you sync up your commits with regular dumps of the database, you should have a versioned way to restore data.

Honestly, I think just using MySQL's native tools would probably be a better solution, but what I've outlined here gets you versioning your MySQL data which is what I think you were after in the first place.

brycemcd
  • 4,343
  • 3
  • 26
  • 29