finding a duplicate in a hdf5 pytable with 500e6 rows

Question

Problem

I have a large (> 500e6 rows) dataset that I've put into a pytables database.

Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.

As a starter I've done something like this:

index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
    query = '(id == %d) & (counts == %d)' % (row['id'],  row['counts'])
    result = th.readWhere(query)
    if len(result) > 1:
        print row

It's a brute force method I'll admit. Any suggestions on improvements?

update

current brute force runtime is 8421 minutes.

solution Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:

ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)

ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
  if row['hash'] == ref:
    dups.append(row['hash'] )
  ref = row['hash']

print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)

I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.

Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.

This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

Do you know anything about where the non-duplicate row might be? E.g., has an id higher than X? — Simeon Visser, Dec 23 '13 at 12:23
not a clue unfortunately. and to make matters worse, the IDs are not explicitly sorted. — scrooge, Dec 23 '13 at 12:41
What's the range of the inputs? Could they be compactly represented in just a few bytes, or hashed down to just a few bytes? — tripleee, Dec 23 '13 at 12:47
Take a look at this question: http://stackoverflow.com/questions/1315129/efficient-way-to-verify-that-records-are-unique-in-python-pytables, it seems to be a similar problem. — Gianluca, Dec 23 '13 at 12:48
both columns fit in unsigned short ints. I guess you could pack them both into a 4 byte int, or a hash. so then it becomes a matter of finding duplicates in a single column. worth some speed tests i guess... — scrooge, Dec 23 '13 at 12:55
Parallelise this using `threading.Thread` (or anything of the sort) and just live with the fact that having 1 "bad" row among **500e6** (I made this bold because the number is simply ridiculous) others is a *needle in a haystack* problem? — Aleksander Lidtke, Dec 23 '13 at 12:55
You can try `th.readWhere('(id == cid) & (counts == ccounts)', {'cid': row['id'], 'ccounts': row['counts']})`. If I read docs correctly, it could work and be faster. — zch, Dec 23 '13 at 13:13
Aleksander: it's seismic data. admittedly this is my biggest dataset yet, but i've had 3 datasets this year over 100e6 rows. the scary thing is the actual data is starting to exceed 10k samples per frame. right now i'm just dealing with data headers. i'm looking at some parallelised sorting algorithms atm. I figure if i can guarantee sort, i only need to compare neighbours for duplicates. — scrooge, Dec 23 '13 at 13:15
If you can hash/combine your keys to a 4 byte int, try using a bit vector to indicated found keys. — mpez0, Dec 23 '13 at 13:54
A bit vector is a bitmap, if bit *n* is set then you have seen key *n*. — tripleee, Dec 27 '13 at 16:20
one interesting result to come out of some testing, on small scale tests (1e4 rows) using `where()` is around 3 times faster (11.6s) than using `readWhere()` (35s). am repeating tests with 1e6 rows. — scrooge, Dec 28 '13 at 13:51

score 4 · Answer 1 · answered Dec 27 '13 at 13:38

4

Two obvious techniques come to mind: hashing and sorting.

A) define a hash function to combine ID and Counter into a single, compact value.

B) count how often each hash code occurs

C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)

D) sort this data set to find duplicates.

The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.

This is essentially a Bloom filter.

answered Dec 27 '13 at 13:38

Has QUIT--Anony-Mousse

76,138
12
138
194

what kind of hash function did you have in mind? i'm just a little wary of touching row by row functions. i would probably have to limit myself to simple functions which can be computed via tables.Expr. like multiplying the two columns. – scrooge Dec 28 '13 at 14:05
1

That depends on the distribution of your IDs. You can probably just use `(a + b * p1) % p2` for two well-chosen prime numbers p1 and p2. – Has QUIT--Anony-Mousse Dec 28 '13 at 14:18

score 1 · Answer 2 · answered Dec 27 '13 at 17:16

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.

I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.

In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.

If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

create_csindex() gives a significant improvement on small scale tests. however on the actual dataset the run time still appears to be prohibitive. probably because it scales close to O(N^2). doing some timing tests now. — scrooge, Dec 28 '13 at 13:49

finding a duplicate in a hdf5 pytable with 500e6 rows

2 Answers2

Linked