3

I've got shared hosting with a few thousand Wordpress installs and I've wanted for ages to have a nice way of removing all the duplicate files in a sensible and secure way. I'm looking for better disk cache hit ratios and simpler backups.

I'm just using standard Ext4, not something like ZFS which has it built in (at a cost).

I'm familiar with tools like rdfind is almost perfect. It can scan over all the files, find the duplicates and hard link them together. I could run it on a weekly cron at off peak times thus making the cost virtually zero.

The problem is I want a single account changing a file to destroy the hard link and give it's own copy of the file again. This way one site updating Wordpress or a plugin wouldn't mess with any other sites. That would also remove potential security issues as well since no account would be able to tamper with another account's files. Sort of Copy-on-write for links.

Is anything like this possible? I've tried doing some searches but I haven't been able to find anything.

Nick
  • 287
  • 1
  • 10
  • How about OverlayFS – kofemann Jan 05 '21 at 22:55
  • How about Btrfs and reflinks? From the `cp` man page: `When --reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if --reflink=auto is specified, fall back to a standard copy.` There are also tools to convert duplicate files (or portions) into reflinks on Btrfs. – bitinerant Jan 06 '21 at 02:00
  • @kofemann OverlayFS would be nice if I had absolute control over every site, but unfortunately I don't and I don't think that would easily let me re-deduplicate files if Wordpress upgraded or something like that. – Nick Jan 06 '21 at 22:12
  • @bitinerant reflinks almost sound ideal, except I wouldn't be using cp so a tool like rdfind would have to be run periodically to create the reflinks. I'm not sure such a tool exists currently? – Nick Jan 06 '21 at 22:15
  • @Nick - you certainly don't need to use `cp --reflink ...` to benefit from Btrfs reflinks. Tools like [duperemove](https://github.com/markfasheh/duperemove) and [dduper](https://github.com/Lakshmipathi/dduper) can scan for duplicate data and create reflinks in the background while you work. – bitinerant Jan 06 '21 at 22:52

1 Answers1

1

It looks like the best solution for efficient 'offline' deduplication is BTRFS reflinks.

That keeps the links 'destructible' if something tried to change a file (E.g. a Wordpress update) and so security and ease of use of the platform is maintained.

Thanks @bitinerant for pointing that option out. I'll be doing further experiments to see if it's worth migrating for my particular scenario. The fact I can migrate EXT4 to Btrfs makes it a lot more feasible than ZFS or similar.

Nick
  • 287
  • 1
  • 10