1

I have an offsite backup solution which runs on C++ to break the files into blocks, and keeps track of the blocks using md5 hashes on a SQLITE3 database. And it transfers the blocks along with the database to a remote site.

So, when I want to do a restore, it queries the SQLITE3 database and restores the blocks accordingly.

When the first backup runs, it creates a big table called the base_backup. Every subsequent file changes or new files are added as new records in a new table. If I want to do a restore, I query the base_backup table plus all the differences and restore the files.

The way the backup runs, it scans for all the files in a given folder for the archive bit, and if it is cleared, then verifies if a record does not already exist in the database and decides whether to back it up or not.

Coming to my question, if a file is deleted on the local computer, how do I keep track of it and update the offsite backup accordingly? Because when I do a restore, I don't want to restore all the garbage files. Is there anyway of knowing if files have been deleted from a folder or not? I do not want to run a verify check from the database since it will take too long.

  • 1
    Any particular reason you're not using existing backup software? – Andrew Marshall Mar 25 '11 at 14:55
  • Like, s/he is implementing his/her own product? – Alexander Poluektov Mar 25 '11 at 15:07
  • Are you able to quickly check what files you have backed up in the past from a given directory? – Jonathan Mar 25 '11 at 15:37
  • Yes, as Alexander said we are trying to develop an in house product to sell to our clients. We currently use Retrospect(EMC2), but it does only local backup and our product is aimed towards offsite backup. So, after the local backup, the program exports the data to a central server. – roymustang86 Mar 28 '11 at 13:26
  • @Jonathan, it is not that quick. There are sometimes 90GB worth of data to be backed up the first time, and thats almost 100K+ files. We can query it to see if it is backed up. And that consumes more time. – roymustang86 Mar 28 '11 at 13:27
  • @roymustang86 Okay. I was hoping you could make a fast query for a single directory, so that you could quickly compare the set of backed-up file entries to the set of actual files in the system. – Jonathan Mar 28 '11 at 13:31

3 Answers3

1

inotify with IN_DELETE?

Alexander Poluektov
  • 7,844
  • 1
  • 28
  • 32
  • That only works if his backup program is running while the file's deleted. – Gabe Mar 25 '11 at 15:06
  • Right. Need some monitoring process to be running. – Alexander Poluektov Mar 25 '11 at 15:22
  • Plus, sounds like Windows (there was mention of an "archive bit"). – Jonathan Mar 25 '11 at 15:35
  • I am currently using the archive bit. But, you can check the archive bit, only if the file exists. If the file is deleted, there is no way of telling it. And yes, I dont want the process to be constantly monitoring the folder. I was reading something about NTFS journal, does anyone know how to read/decipher it? – roymustang86 Mar 28 '11 at 13:24
0

Create a Service to monitor the directory (Use FindFirstChangeNotification or ReadDirectoryChangesW)

João Augusto
  • 2,285
  • 24
  • 28
0

You could add a new piece of information to your database which lists which files existed during the last backup. Then, even if a file had not changed, a new (small) entry would be made during the backup, indicating that it still existed.

When restoring a backup from a given date in the past, only select the files which had entries specifying that they existed during the previous backup.

For example, a pair of tables like this might work:

Path(text)    BackupIndex(int)
path/to/file1  1
path/to/file2  1
path/to/file1  2

Notice that path/to/file2 does not appear in backup #2, as it was not in the directory during the backup (it must have been deleted).

BackupIndex(int)    Timestamp(timestamp)
1                   2011-03-12 7:42:31 UTC
2                   2011-03-20 8:21:56 UTC

Somebody wants to restore as files existed on March 15th, you look at the table of backup indices, see that backup #1 was the most recent, and look up all paths that existed in backup 1 from the paths table.

So basically, you are pushing off deciding whether a file was deleted onto the restore operation, rather than the backup operation.

Jonathan
  • 13,354
  • 4
  • 36
  • 32