Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.