0

I am dealing with hundreds of thousands of files,

I have to process those files 1-by-1, In doing so, I need to remember the files that are already processed.

All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.

But, I think that there should be some better way,

Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?

wnoise
  • 9,764
  • 37
  • 47
Alphaneo
  • 12,079
  • 22
  • 71
  • 89
  • see also http://stackoverflow.com/questions/2962207/constructing-a-hash-table-hash-function – Mark Elliot Nov 10 '10 at 05:21
  • 1
    What do you need to remember them for? It depends what you're going to with with this information. – John Kugelman Nov 10 '10 at 05:22
  • @J Kugelman, Means it's not a hash table, but just a single key that would just remember, if a particular string was already encountered or not. – Alphaneo Nov 10 '10 at 05:26
  • 1
    Take a look at [Bloom Filters](https://secure.wikimedia.org/wikipedia/en/wiki/Bloom_filter). They are not 100% accurate but they do satify your need of a "single key". – Abhinav Sarkar Nov 10 '10 at 05:52
  • @abhin4v I think Bloom Filters is the closest to what I was looking for, thank you. I am now looking into it. – Alphaneo Nov 16 '10 at 07:56

5 Answers5

3

You could use some kind of hash function (MD5, SHA1).

Pseudocode:

for each F in filelist
    hash = md5(F name)

    if not hash in storage
        process file F
        store hash in storage to remember

see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5

Community
  • 1
  • 1
  • @RC Thanks for the response, can you please give some more detail. – Alphaneo Nov 10 '10 at 05:22
  • @RC Thankyou for the details, I think I need to rephrase my question. I am JUST interested in knowing if I have already processed the file using the filename. And a key that would let me know if I have processed the filename ... – Alphaneo Nov 10 '10 at 05:35
  • @Alphaneo, added clarification –  Nov 10 '10 at 05:41
  • 1
    This is no different to storing the path URLs...except it takes longer since you are hashing each file. I think his original question stated that he wants to be able to deduce from a SINGLE key which files have been processed and which have not. – Nico Huysamen Nov 10 '10 at 05:58
  • I suppose he wants to compute the hash for the file name. – starblue Nov 10 '10 at 15:52
2

There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
2

If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.

As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.

Nico Huysamen
  • 10,217
  • 9
  • 62
  • 88
2

Bloom filter can solve your problem. Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions. When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1. To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.

Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.

kjoshi
  • 21
  • 2
1

What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.

You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

MAK
  • 26,140
  • 11
  • 55
  • 86
  • this is the closest SUGGESTION that I got, though Bloom filter was what I was looking for, so I will go ahead and accept this answer. – Alphaneo Nov 25 '10 at 00:47