Hard drive with many hard links 'fills' almost overnight

Question

I recently installed a cheap 2tb hard drive on a server for backing up files which are also backed up elsewhere. It's basically an overflow drive. The other drives on the server are configured as 1TB in a Raid 6 array. This single drive I configured as Raid 0 just for convenience.

In essence I was moving about 700GB of data from the Raid 6 drive to the Raid 0 drive because the Raid 6 drive was almost full. So ... 2 TB should be way more than enough, right?

The data is in the form of data rsynced from a remote server, with 6 days of incremental backups handled in a standard 'hard link' manner to ensure I am only storing/transferring changes, and not backing up the entire data every day.

However the behaviour I am seeing is that data that was stored at around 700GB on the Raid 6 drives quickly balloons to almost fill the 2TB drive, as if I were not using hard links.

Yesterday I deleted about 300GB of data which is no longer needed, and overnight the storage was back to 97% full.

Does anybody know what's going on? Is the drive really 'full', or is it just bad calculation of hard linking?

All drives are formatted as Ext4.

** Edit **

Details of backup process:

Each day a cronjob copies backup0 to backup1 using cp -al backup0 backup1. Previous backups are moved by mv backup1 backup2, etc, prior to an rsync taking place.

backup5 is deleted each day. After that happens, a remote server rsyncs to backup0 (thus updating only changed files). Thus, 5 days of incremental backups. This is basically how software like 'backintime' works too.

** Second Edit **

I just deleted backup3 to backup 5 and it freed up about 2 thirds of the drive. So, the problem seems to be how storage is being calculated. (I use df -h to monitor storage).

The question remains ... will the drive be considered 'full' even though there should be ample space, when it reaches "100%".

Did you check if hardlinks are created? How do you measure your filesystem utilization? You might want to add the output of `df -hTl` to your question. — Thomas, Mar 17 '19 at 18:37
How are you 'copying' data to the drive? Is your method aware of hardlinks? — Michael Hampton, Mar 17 '19 at 18:45
Each day a cronjob copies backup0 to backup1 using `cp -al backup0 backup1`. Previous backups are moved by `mv backup1 backup2`, etc, prior to an rsync taking place. `backup5` is deleted each day. After that happens, a remote server rsyncs to backup0 (thus updating only changed files). Thus, 5 days of incremental backups. This is basically how software like 'backintime' works too. — fred2, Mar 17 '19 at 19:06
Note my edits, especially second edit re 'space' freed up by deleting hard links. — fred2, Mar 17 '19 at 19:12
I thought I was! The backup is done by rsync. I use a cron job to create the 5 day increments using hardlinks, not to create the backup. As I said ... this is the same approach used by some mainstream backup applications. — fred2, Mar 17 '19 at 20:31
Two ways on Linux systems to "run out" of disk space "early": The number of inodes available on this partition (investigate with `df -i` ) and the disk space reserved for root (investigate with `sudo tune2fs -l | grep Reserved` ). — Slartibartfast, Mar 27 '19 at 16:46
Thanks for comment and +1 also for your username. Yours, Dentarthurdent. — fred2, Mar 27 '19 at 18:00

score 2 · Accepted Answer · edited Jun 11 '20 at 10:02

Using cp -al isn't necessary, just use mv and rsync.

See Admin Magazine's article: "Incremental backups on Linux":

"Most modern Linux distributions have a fairly recent rsync that includes the very useful option --link-dest= . This option allows rsync to compare the file copy to an existing directory structure and lets you tell rsync to copy only the changed files (an incremental backup) relative to the stated directory and to use hard links for other files.".

That article shows a complete run-through of how the script below works and what it's doing, specifically the inode numbers are the same in each backup (thus saving space):

"... notice that the inode number of the first file is the same in both backups, which means the file is really only stored once with a hard link to it, saving time, space, and money. Because of the hard link, no extra data is required. To better understand this, you can run the stat command against the files in the two backup directories.".

See GNU's cp command:

-a, --archive

Preserve as much as possible of the structure and attributes of the original files in the copy (but do not attempt to preserve internal directory structure; i.e., ‘ls -U’ may list the entries in a copied directory in a different order). Try to preserve SELinux security context and extended attributes (xattr), but ignore any failure to do that and print no corresponding diagnostic. Equivalent to -dR --preserve=all with the reduced diagnostics

-l, --link

Make hard links instead of copies of non-directories.

and Samba.org's rsync command:

-a, --archive

Archive mode; equals -rlptgoD (no -H,-A,-X)

-H, --hard-links

Preserve hard links

-v, --verbose

Increase verbosity

also see GNU's du command:

-h, --human-readable

Append a size letter to each size, such as ‘M’ for mebibytes. Powers of 1024 are used, not 1000; ‘M’ stands for 1,048,576 bytes. This option is equivalent to --block-size=human-readable. Use the --si option if you prefer powers of 1000.

-s, --summarize

Display only a total for each argument.

You'll want something like this:

rm -rf backup.3 
mv backup.2 backup.3 
mv backup.1 backup.2 
mv backup.0 backup.1 
rsync -avh --delete --link-dest= backup.1/ source_directory/ backup.0/

The first few times you run the script, due to all the backup.? files not existing, you'll see some errors but once it's populated everything will be error free. Use du -sh and compare that to ls's output, as in ls -s.

The question remains ... will the drive be considered 'full' even though there should be ample space, when it reaches "100%".

Assuming that the application program you are running and the utility that you use to confirm remaining drive space both use the correct system call to check for remaining space then both will report the correct value, anything close to zero is 'full' because temporary files vary in their size and are constantly created and deleted on an active operating system. Never get close to zero, you'll probably crash and have startup errors when you reboot.

Hard drive with many hard links 'fills' almost overnight

1 Answers1