Pytables duplicates 2.5 giga rows

Question

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows

I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.

Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.

P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.

ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}


print("STARTING")
for row in table.itersorted('xhashx'):
    gen_cnt += 1 #so as not to flush after every iteration
    ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')

    if row['xhashx'] == ref:
        if ps in locs:
            locs[ps][0] += 1
            locs[ps][1] |= row['src']

        else:
            locs[ps] = [1, row['src']]


    else:
        for p in locs:
            fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
            new_password.append()   

        if (gen_cnt > 100):
            gen_cnt = 0
            new_table.flush()  

        ref = row['xhashx']```

kcw78 · Answer 1 · 2020-03-23T12:56:24.570

Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)

Here is a process that might be faster:

Extract a NumPy array of the hash field (column xhashx )
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file

Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps

# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)

# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))

# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :

# Step 3b: Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_test')

# Step 4: Check for rows with unique values
# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if match_row_arr.shape[0] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)

# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :

# Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')

# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if hash_cnts[cnt] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

Thank you for your well-structured response! I have not done any quantitative research into where itersorted takes the most time, but I tried doing some eyetests and came to the conclusion that a method similar to yours might be faster. I didnt think of the fact that I can hold the whole column of unique hash values in memory, so I ended up writing the unique hash values to a file and looping through that. This of course slows down the process again because of the extra file I/O and consequent disk reads. I will try your version and report back! — Miha Smaka, Mar 23 '20 at 22:05
If you find table.read() of the entire 'xhashx' column is too large, you can do it incrementally in aloop (loop1: rows 0->1e8; loop2: rows 1e8->2e8, etc). Just add the start= and stop= parameters to rabe.read(), next find the unique values, then add them to the hash_arr_u array. You will need to recheck for unique values after the loop is done. — kcw78, Mar 24 '20 at 15:14

Pytables duplicates 2.5 giga rows

1 Answers1

Linked