2

I have a couple of identical files stored in more than one place on my hard disk. I figure I can save a lot of disk space by hard-linking them to point to the same file. I am a little worried about possibly disastrous side effects.

I guess it does not affect permissions, as those are stored in the respective directories, just like the file name, right? (Update: Apparently, I guessed wrong, permissions are shared, as Carl demonstrates in his answer)

The biggest concern is changes to one file inadvertently also changing the other files. Read-only files should be safe then. And files that can be changed are also okay, if rather than updating within the existing file, a new file gets written. I believe most applications work that way, but probably not all.

Is there anything else to consider?

I am on OS X / HFS+.

studgeek
  • 14,272
  • 6
  • 84
  • 96
Thilo
  • 257,207
  • 101
  • 511
  • 656
  • 2
    A hard link points two (or more) directory entries at the same physical blocks on disk. As you note, permissions -- which includes read/write permissions -- are stored in the directory, not the blocks on disk. I strongly suggest that you play around with links on files that don't matter, before you start changing important ones. I suspect that you'll realize that disk is cheap. – kdgregory Nov 12 '09 at 00:43
  • 1
    Good idea mentioning OS and filesystem. Apparently, modern filesystems (zfs, btrfs) avoid your disk space issues by automatically storing identical block content on disk exactly once while maintaining the complete file semantics towards the userspace programs. – ndim Nov 12 '09 at 00:48
  • 1
    (And neither zfs nor btrfs are available for OSX.) – ndim Nov 12 '09 at 00:49
  • 1
    I was so looking forward to ZFS :-( Not only for deduplication of identical files, but also for differential storage of slightly changed files. – Thilo Nov 12 '09 at 00:52
  • "As you note, permissions -- which includes read/write permissions -- are stored in the directory, not the blocks on disk." But Carl in his answer demonstrated that permissions are also shared between all copies. – Thilo Nov 12 '09 at 00:58
  • 4
    how is this programming related? I think it goes on serverfault. – Peter Recore Nov 12 '09 at 02:27
  • 1
    @Peter: I am thinking to write a program to deduplicate ;-) – Thilo Nov 12 '09 at 02:41
  • Thilo: I use the `fdupes` program on my Linux systems to find, remove, and link together duplicated files all the time. The source code looks pretty clean, I hope it'd port to OS X cleanly. – sarnold Oct 05 '11 at 01:58

4 Answers4

2

Don't use hard links if you want changes to one file not to be reflected in other files. That's the whole point of hard links - multiple directory entries for the same file (same blocks on disk). Changing permissions on one of the names of a hard link changes them on both:

$ touch file
$ ln file link
$ ls -l
total 0
-rw-r--r--  2 owner group  0 Nov 11 16:44 file
-rw-r--r--  2 owner group  0 Nov 11 16:44 link
$ chmod 444 file
$ ls -l
total 0
-r--r--r--  2 owner group  0 Nov 11 16:44 file
-r--r--r--  2 owner group  0 Nov 11 16:44 link

From the ln man page:

A hard link to a file is indistinguishable from the original directory entry; any changes to a file are effectively independent of the name used to reference the file.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • That was the main part of my question: Do applications really update files? Or just rewrite them? And are they generally hard-link aware, and will rewrite rather than update them, if there is more than one link to them? – Thilo Nov 12 '09 at 00:47
  • That's completely application dependent, but I expect almost completely opposite to what you think is happening. I would guess 99% or more of all applications modify existing files rather than deleting them and creating new ones. – Carl Norum Nov 12 '09 at 00:50
  • Maybe 99% is too much. But I certainly wouldn't count on a deletion/recreation in the general case. – Carl Norum Nov 12 '09 at 00:53
  • @Carl: Yes, I am mostly thinking about read-only files. Primarily, I want to dedupe Time Machine backups. The fact that the permissions are shared (as you have shown above) worries me a little. Why is that, by the way? Where are the permissions stored? I thought in the directory. – Thilo Nov 12 '09 at 01:00
  • They're stored in the inode. This link has more: http://docstore.mik.ua/orelly/networking/puis/ch05_01.htm – Carl Norum Nov 12 '09 at 01:32
  • Doesn't Time Machine already use hard links where possible to avoid duplicating file blocks? – mipadi Nov 12 '09 at 02:41
  • Is that a legitimate online version of O'Reilly books? – Thilo Nov 12 '09 at 02:42
  • @mipadi: TM uses hard links only if the same file (path) has not changed from the previous version. It does not work if you just happen to store the same content in two different locations. It also does not work across machines (if you back up two machines to the backup disk). – Thilo Nov 12 '09 at 02:46
  • @Thilo, I have no idea. I just googled it. – Carl Norum Nov 12 '09 at 05:50
1

I wrote a little script to do just this. I'd only be concerned about permissions if your backup was spanning multiple users or system files.

I had a bunch of old backups on CD's and DVD's, many of which had a lot of redundant data on them. Rather than sift through all that info and delete the duplicates, I took the Time Machine route and made hard links between all the matching files (truly matching content, I took a SHA1 checksum of them all).

Now all my backup volumes look just like they would otherwise and most of the redundant files are history. The one hiccup is that a lot of media files store metadata in the file contents so each version is slightly different. See this article for the python code. No Warranties!!!

Make sure you do mdimport your_backup_dir/ afterwards: Spotlight and Finder get a bit flustered when you do massive data manipulations. I've de-duplicated my 240 GB backup folder in this manner and it took about 45 minutes.

Also note, most OSX apps will break your hard links and save in a new inode, most UNIX'y apps probably will preserve the hard links (except emacs, i hear).

andyvanee
  • 11
  • 1
0

If your primary goal is to "dedupe Time Machine backups" as you mention in one of the comments, then another option that avoids some of your concerns would be to eliminate the dupes from Time Machine using the Time Machine preferences. You can exclude at the directory or file level.

studgeek
  • 14,272
  • 6
  • 84
  • 96
0

Hardlinks are not generally a best practice. plain old soft/symbolic links (ln -s) should serve just as well.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • 1
    I figure that for files that can change, soft-links are even worse, because then the change is reflected in all copies. Also, if the target of a soft link gets deleted, the data is lost (does not happen with a hard link) – Thilo Nov 12 '09 at 00:45