I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash)
, while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
print("STARTING")
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
else:
locs[ps] = [1, row['src']]
else:
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
new_password.append()
if (gen_cnt > 100):
gen_cnt = 0
new_table.flush()
ref = row['xhashx']```