2

I have a problem concerning backup. I have a network (constructed using Wireless 150Mbps and Gigabit ethernet) consisting in at least 3 computers (plus maybe 2 in remote).

I have a plan to construct a linux server {pretty powerful} which will do (pretty much):

  • Media center (recording / playback)
  • FTP server to serve files on my network
  • Other servers to developing applications (MySQL, apache,... )
  • BACKUPS

Concerning the BACKUPS aspect, the machines to be backupped are running:

  • 3 x Linux >=2.6.30 (Gentoo and Archlinux)
  • 1 x Windows XP 32 bits
  • 3 x Windows 7 64 bits
  • 1 x Windows 7 32 bits

The backup might be performed using smb file share {I'm not really lucky with it nowadays} / rsync / svn / tar / or anything else or combined you might suggest. The fonctionnalities are (in order of priority):

  • Revisions (SVN-style): a file has to be backed up each time it gets modified (and multiples versions of the same file can exist on the server, in fact they must)
  • Scalability: if I attach an USB drive to the computer, I want it's data to be backed up as well (well... That on linux might be quite easy, simply backup all /media/ except cds and dvds, but for windows? )
  • Near real-time (~ 5 minutes at max) file backup: I lost a latex report and was hard to reconstruct it from scratch
  • No-Duplication: for instance if I backup the USB's disk content from 2 differents computers, I do not want the data to be backed up twice (symlink instead of hard copy in worst case)
  • Manual restore / automatic restore: it's the same for me (simply not like described here below)
  • I do not want to look in 1000 folders to find each time the same directory structure in which there are only 10 files (I prefer to look in ONE directory to find all the latest files in the File System structure, like /media/BACKUPS/PC01/home//... )
  • Maybe ability to remove / exclude large files from backups
  • Good logs

Server specs:

  • 2 x 2TB hard disc space used for backups (in fact 1 is used for backups, the other one will be rsynced from the first one {I prefer not RAID 1}, just in case... )
  • 4 to 8GB RAM DDR3
  • At least 4 cores (AMD Athalon II x4 640 @ 3.0 GHz) -> upgradaeble to Bulldozer later

What I had already considered (might consider it again if you point out some interesting caracteristic):

  • Backuppc
  • Rsync (problem: no file versioning, windows client might be buggy)
  • SVN (problem: 2 x overhead - files are copied twice, thus 2 x file disk usage)
  • Amanda backup / Bacula (not really understood what they can and can't do)

I know a bit of BASH and Python programming on the server side. I might eventually even make up a web interface using apache / php / MySQL. All I need to know is the best components to use to achieve this (i.e. which backup software on the server, which protocol, which client, which caracteristics to implement accordingly).

user76949
  • 141
  • 7

2 Answers2

1

You can do very well with Bacula/Amanda. Hitting your requirements:

Revisions (SVN-style): a file has to be backed up each time it gets modified (and multiples versions of the same file can exist on the server, in fact they must)
Bacula and Amanda will grab a file each time it changes.

Scalability: if I attach an USB drive to the computer, I want it's data to be backed up as well (well... That on linux might be quite easy, simply backup all /media/ except cds and dvds, but for windows?)
Not bad on Unix (Just back up everything under / and it will grab the media), but probably not possible on Windows -- I believe you need to specify the drives you want to grab because the filesystem isn't a tree hierarchy under a specific root (there's a root for each drive).
That said, it's probably NOT a good idea (What if you attach a full 1TB drive to a machine being backed up? Your backups just ballooned).

Near real-time (~ 5 minutes at max) file backup: I lost a latex report and was hard to reconstruct it from scratch
Not happening -- You CAN specify a 5 minute backup window, but your logs will be filled with jobs being killed because there's already a duplicate running.
You can schedule nightly backups, or even every 12 hours without much trouble.
(Even Apple's Time Machine only does hourly backups... think about the largest file that may change and have to be shoved over the wire...)

No-Duplication: for instance if I backup the USB's disk content from 2 different computers, I do not want the data to be backed up twice (symlink instead of hard copy in worst case)
Bacula doesn't have deduplication at this time. Not sure about Amanda.

Manual restore / automatic restore: it's the same for me (simply not like described here below) Restores are (and should be) a manual process. I have no idea what an "automatic restore" would look like (the backup server decides on its own to restore a file? :)

Maybe ability to remove / exclude large files from backups
You can include or exclude specific parts of the filesystem (down to file-level granularity) in Bacula.

Good logs
Database-backed lists of jobs and results, with the ability to write to log files, email, etc. in the event of errors.


BackupPC may also be able to hit these requirements (not certain - haven't used it) - other commercial backup solutions almost certainly can as well.
You may also want to consider tarsnap, though I'm not sure how the Windows support is.

voretaq7
  • 79,879
  • 17
  • 130
  • 214
  • Thank you for the quick answer. What about restoring ? I do not want to look for each file / folder into a separate subfolder archived by date (like take yesterday's files from /media/backups/02_04_2011/, day before under /media/backups/01_04_2011 andsoever). Do bacula / amanda provide that (or, in the other case, just don't want to restore the full backup plus the other 1000 incremental backups). – user76949 Apr 03 '11 at 06:53
  • See the Bacula/Amanda docs (Bacula.org or Amanda.org) for info on how restores work - Short answer: A full restore will require a full & all incrementals. Restoring a file from a specific incremental backup only requires that backup. – voretaq7 Apr 03 '11 at 15:13
  • I think it might not really suit my meeds (at least for now). Thank you for your time. – user76949 Apr 05 '11 at 18:54
0

Revisions (SVN-style): a file has to be backed up each time it gets modified (and multiples versions of the same file can exist on the server, in fact they must)

What are the files in question? Are they users' data files, or system configuration files? For the former, Dropbox (with referral, or without). The only other alternative I see is to roll your own Dropbox-like service. For the latter, consider moving to a configuration management system like Puppet, put the system's files into a version controlled repository of your liking, and back up the repository however you like.

Regular backup systems will only grab files when they run (daily, multiple times daily, etc.), not whenever they change.

Near real-time (~ 5 minutes at max) file backup: I lost a latex report and was hard to reconstruct it from scratch

Dropbox or similar. No other option I can see: enter image description here

No-Duplication: for instance if I backup the USB's disk content from 2 differents computers, I do not want the data to be backed up twice (symlink instead of hard copy in worst case)

Backuppc can do de-duplication. Amanda can't, as far as I know. But depending on what you're trying to avoid duplicating, there may be another route. If I backed up all my compute nodes at work, for example, I'd have tons of duplication. But I don't back them up at all -- I can rebuild one from scratch within an hour or so with a combination of Debian unattended installation features and Puppet.

I do not want to look in 1000 folders to find each time the same directory structure in which there are only 10 files (I prefer to look in ONE directory to find all the latest files in the File System structure, like /media/BACKUPS/PC01/home//... )

Amanda, at least, isn't built like rsync. It will back up volumes (either partitions or folders) on a regular basis into backup files. You can browse through the backups with amrecover, and restore whatever files you want. But the files for each volume's backups are stored in dump files, tar files, or similar.


There's a lot of room in your question for further clarifications. The main questions I'd ask are:

  1. Is this backup for disaster recovery, or for longer-term archival purposes?
  2. What are you backing up, and why?
  3. How much effort are you willing to go to, and what can you live without? If the minimum standard of success is "back up every change to every file on every drive on every computer on every OS, in near real-time, including if I connect and disconnect removable drives on Windows", you're likely to be disappointed.
Mike Renfro
  • 1,301
  • 1
  • 8
  • 11
  • 1. What do you mean exactly ? I think it's for disaster recovery or for about 6 months (at max) of archival purpose 2. I'm backing up all files inside my /home folder (maily programming in c++ / java), reports (latex) and some images. All the big stuff (like movies) will reside on the server on a separate drive (not really good for redundancy but whatever ... do you have another suggestion here ?) 3. OK maybe I exaggerated at first while pretending all USB drives will be backupped up. Anyway I only mount some devices under linux so I don't have this problem. – user76949 Apr 03 '11 at 06:58
  • I'd say minimum is 24 hours backup when the computer is up and running (which is NOT always the case, 'cause the computer is in my room and can't really sleep with it on everynight). Let's say a 24-hours backup interval: when the computer gets up after 24 hours, a backup is runned. – user76949 Apr 03 '11 at 07:03
  • If you're focused on backing up your home directory (or particular folders in it), I'd be looking at either Dropbox, rolling your own Dropbox-like service, or putting those folders under version control. You're not focused on rebuilding the whole computer if the hard drive fails (that's what I call disaster recovery). You can store a ton of LaTeX and code in Dropbox's free quota, or pay a bit to get up to 100 GB of storage. If you don't want that, and really can live without versioning, then having a remote backup server send a wake-on-LAN packet to your desktop and back it up is fine, too. – Mike Renfro Apr 03 '11 at 17:40
  • Versioning should be fine. Only problems I see (with SVN at least) might be: 1. SVN creates a bunch of .svn subfolders, making it quite heavy on the disk 2. *If* I create my Dropbox-like service, is restoring ALL the data from my home directory (except Videos which will not be backed up) possible and easy to do ? 3. May be difficult to select which folders to backup and which not. Can I create a dropbox-like service which is compatible with linux and windows to do that actually ? Mainly I would like to backup all my home folder except some folders that already exist. Is that possible ? – user76949 Apr 05 '11 at 03:02
  • It worries me a bit the "As it stands the system uses the source system as the preferred environment, so any files that change, or are added or removed, will be processed on the remote system". Does it mean that if I delete a file on the client computer, it will also be removed by my dropbox-like service ? Thank you very much. – user76949 Apr 05 '11 at 03:12
  • By nature, SVN has to keep some other files around on the client system, but I don't think it's as heavy as you think. I have a set of LaTeX files for a thesis style and examples, and the raw content is about 4 MB with a similar amount in the .svn folders on a working copy. But the *entire* remote repository, with all the changes I've checked in over a period of 3 years, is only around 6 MB. And since I back up the *repository*, not the working copies, 6 MB of backup space (less if compressed) gets me the ability to roll back to any given changeset over a 3 year period. – Mike Renfro Apr 05 '11 at 13:38
  • As for Dropbox and similar services, if you've not tried regular Dropbox, do so. See if it works for you. I don't have direct experience with rolling my own, but the original Dropbox service works fine for keeping multiple systems synced to one copy of my working files, including deletions, and lets me go back to older versions if needed. – Mike Renfro Apr 05 '11 at 13:41
  • I tried dropbox a little bit but have to try it a bit more I think. My main concern is not about syncing anyway. It's backup. The way this is going is: Backup -> Syncyng -> Shared folders on the server, which is not really bad (considering I would back up the same data anyway). At this point a shared folders system using SMB and sshfs using fuse might be better. It makes no real difference to me whetever I keep data on the server, as only as I don't lose it. Considering I own 3 computers and I have 6 operating systems (3 of which I use often) this might be an even more interesting solution. – user76949 Apr 05 '11 at 15:50
  • Server will surely be Linux >2.6.30, probably Debian 6.0 or (less probable) Gentoo. I might even reverse roles: I keep data on the server which is on 24/24, with hardware backup (2nd disk) and file versioning (what about it on the server, locally ?, without network backup), then sync once a week a third backup on the big desktop on a third disk ? Do you have a suggestion here (BRTFS is a bit unstable nowadays, I don't really see any versioned filesystem for linux). I keep my work under /home/data and then SVN-it (every 2 hours for instance ?) to /media/backup ? – user76949 Apr 05 '11 at 15:56
  • If you keep data on the server under version control (SVN or similar), then your problem just comes down to how to back up the SVN repository on the server. That removes any versioning requirements from the filesystem, and you can use whatever filesystem you want. See [this question](http://stackoverflow.com/questions/33055/what-is-the-best-way-to-backup-subversion-repositories) for how to back up a SVN. I use FSFS for my SVN backend, so it's easy. – Mike Renfro Apr 05 '11 at 17:55
  • I looked a bit on FSFS webpage and don't really understand if it's what's best for me (can't really understand what it does). Anyway to back up the SVN repository on the server, the "ugly" way could be to do rsync the repository to a local (and remote) copy. Or otherwise use svnadmin dump or hotcopy, maybe even svnsync. – user76949 Apr 05 '11 at 18:43
  • Only thing I wanted that *might* not be included in SVN (as for what I had in my wish list) is the ability to remove old files, or files that have been moved. Just for the clarification, this is the plan: - Have on /media/disk1/data all data, accessible through SMB, sshfs and NFS - Have on /media/disk1/backup all data backed up in a SVN repository, accessible throught SVN (this is only for revisions support, as if the disk crashes ...) - Have on /media/disk2 the exact copy (but not software raid-1, rsync preferably) of /media/disk1 – user76949 Apr 05 '11 at 18:48
  • Should this work (though the amount of work would "only" to develop a small web app to keep track of changes, logs, restore, ...) or do you have a better idea ? I forgot to mention I want to put all my data there (therefore fotos, documents, ...) [ARCHIVE things], let's say 50GB. Could there be any problems ? Thank you very much for your advices. I really appreciate it ! – user76949 Apr 05 '11 at 18:49
  • If you truly want to purge a file so that there was no record of its existence, that file should have never gone into SVN at all. Other version control systems may be easier in that respect, but I've never used them. You have a few specific questions you can now ask separately, or else look for related questions. They probably deserve their own topics at this point. First, how to set up a SVN repository. Second, how to back up data from one disk to another (rsync, Amanda, Bacula, etc.) Finally, I don't think the webapp is necessary until you've seen what's already included with Bacula, etc. – Mike Renfro Apr 06 '11 at 00:40
  • Problem with bacula and amanda seems to be file restore: have to restore the full plus all incremental backup and I can't really keep track of changes (not in a visible way anyway). At least with SVN ä checkout restores the whole repository (in the worst case scenario). Anyway I can use SVN to manage revisions, just in case I lost a recent copy of the file I don't have to start from scracth. On the other hand in the case of disaster recovery I have the 2nd disk plus the big desktop's backup. But probably at this point software raid-1 makes more sense. The big downside of this is that – user76949 Apr 06 '11 at 02:26
  • it will use twice the disk space, but all files are editable on a separate computer and shared across the network, which is fine for my purposes. SVN seems to support also the delete command, but it makes not a big sense since SVN is for keeping an archive of changes: http://www.abbeyworkshop.com/howto/misc/svn01/. – user76949 Apr 06 '11 at 02:30
  • Otherwise keep a SVN local folder on the computer and sync to the server and resync back to all clients might also be a possibility (for redundancy), as it also allows to specify which user / computer changed the file. Obviosly all linux specific configurations (.gnome, .whatever) will be excluded from this, maybe backupped to a specific (other) repository. Thank you for your suggestion. They're definitively helping me open my mind. What about dual boot (on the same computer) ? Should I sync the repository to windows, then to linux or it is not necessary (maybe just need to change username) ? – user76949 Apr 06 '11 at 18:46
  • I'd prefer not checkout ~50 GB on Windows then on linux. Once a file is checked out by svn in linux to a specific folder, can I recommit and try to checkout from windows (on the same folder) without pulling out the (same) 50GB ? Can I change username (just to keep track on who did the changes) ? – user76949 Apr 06 '11 at 18:48
  • I strongly suggest starting a new question for each particular topic at this point. I'm not sure what your experience is with SVN, backups, or any of this, and it's getting difficult to keep a conversation going a few hundred characters at a time (especially if nobody else is paying attention). – Mike Renfro Apr 06 '11 at 20:09