Good way to check if file is unique in C

Question

I am writing a C program which calculates the total size of files in a given directory. I know that each file points to an inode, so I am planning to use stat to find the inode value and file size. Since I want to avoid erroneous calculation when there are multiple hard links and/or sym links to an inode, I want to store the inodes in an array. Problem is, now to check if the inode is unique for a given file, I would have to iterate through the inode array again, giving a runtime of approx n^2. I want to avoid overly complex structures such as RB trees. Is there a faster, more clever way to implement this? I know there are system tools which does this, and I want to know how they implement something like this.

JuniorCompressor · Accepted Answer · 2015-02-24T01:50:22.157

Even binary trees are a good choice since under random data they are relatively balanced. This is also a very simple structure to implement.

In general, the structure of choice is the hash table with constant average search time. The challenge here is to find a good hash function for your data. Implementation of hash tables is not difficult and I guess you could find a lot of good libraries implementing them.

But if you are willing to wait until you store all inodes in the array, then you can sort this array and traverse it in order to find duplicates..

EDIT:

Inodes contain a reference count. This counts the number of hard links. So you could check for duplicates among the inodes with reference count > 1.

In general, the number of files in a directory (not doing recursive) probably isn't that high to warrant a hash table. So a binary tree would probably be practically faster than a hash table. — mrQWERTY, Feb 24 '15 at 01:16

score 2 · Answer 2 · edited May 23 '17 at 11:43

Use a hash table. It is O(1) (though somewhat expensive for tiny sets). Of course you may find this "overly complex" as you said about red-black trees, but if you want good worst-case performance you'll need to do something a little more complex than a plain array (which by the way would be fastest for small sets, despite worse theoretical time complexity).

If you don't have a hash table implementation already available (this is C after all), there is an overview of several here: https://stackoverflow.com/a/8470745/4323

Good way to check if file is unique in C

2 Answers2