3

I am trying to find the weighted jaccard distance for every pair of rows in a ~8000*8000 pandas df matrix. I've tried the following :

import pandas as pd
import numpy as np

def weighted_j_sim(array1, array2):
    return np.minimum(array1, array2).sum()/np.maximum(array1, array2).sum()

matrix = pd.DataFrame([[1, 2 ,3],
                       [2, 1, 1],
                       [3, 1, 1]])


for index, (name, values) in enumerate(matrix.iterrows()):
    for other_index, (other_name, other_values) in enumerate(matrix.iterrows()):
        if other_index>index: #dont check self or something already compared
            weighted_js = weighted_j_sim(values, 
                                         other_values)

and

import pandas as pd
import numpy as np
            
def weighted_j_sim(array1, array2):
    #https://stackoverflow.com/a/71276180/11357695
    q = np.concatenate((array1.T, array2.T), axis=1)
    return np.sum(np.amin(q,axis=1))/np.sum(np.amax(q,axis=1))

matrix = pd.DataFrame([[1, 2 ,3],
                       [2, 1, 1],
                       [3, 1, 1]])
    
for index, (name, values) in enumerate(matrix.iterrows()):
    for other_index, (other_name, other_values) in enumerate(matrix.iterrows()):
        if other_index>index: #dont check self or something already compared
            weighted_jd = weighted_j_sim(np.array([values.values]), 
                                         np.array([other_values.values]))

This is very slow - can anyone suggest some numpy magic to apply here?

Tim Kirkwood
  • 598
  • 2
  • 7
  • 18

1 Answers1

0

One big performance issue is that the code performs a quadratic run-time algorithm on Pandas series objects which are generally much slower than Numpy arrays. Another issue is that converting the same row to Numpy array several time is not efficient. Moreover, creating many times temporary Numpy array is a bit expensive. Additionally, traveling the same array for both the minimum and the maximum can be inefficient if the array is huge as it needs to be reloaded to the L1 cache.


Faster implementation

The first thing to do is to convert the pandas dataframe to a contiguous Numpy array. Then, we can use compiler-based tools like Numba and Cython so to avoid creating many temporary array and use the cache more efficiently. Besides, we can also easily remove the condition by starting the second loop at the right index. On top of that, one can use multiple thread (assuming the computation of weighted_js is thread-safe).

Here is a quite fast Numba implementation:

import numba as nb
import numpy as np
import pandas as pd

matrix = pd.DataFrame(np.random.rand(1000, 1000))

# Designed for 64-bit float -- please change the type if you want another kind of matrix
@nb.njit('float64(float64[:,::1],)', fastmath=True, parallel=True)
def compute_fast(matrix):
    checksum = 0

    for i in nb.prange(matrix.shape[0]):
        for j in range(i+1, matrix.shape[0]):
            sum1 = 0.0
            sum2 = 0.0
            arr1 = matrix[i]
            arr2 = matrix[j]

            for k in range(matrix.shape[1]):
                a = arr1[k]
                b = arr2[k]
                sum1 += min(a, b)
                sum2 += max(a, b)

            weighted_js = sum1 / sum2

            # Required for Numba not to optimize out the whole loop 
            # (otherwise weighted_js would be unused and the wholeloop useless).
            checksum += weighted_js

    return checksum

compute_fast(np.ascontiguousarray(matrix.to_numpy()))

Performance results

Here is the resulting performance on my i5-9600KF processor on a 1000x1000 dataframe:

First code of the question:       166.3 seconds
This optimized implementation:      0.1 second

The Numba implementation is thus ~1660 times faster than the (first) implementation provided in the question. It is able to compute a 8000x8000 dataframe in only 65 seconds (instead of dozens of hours).


Notes and performance improvement

Note that fastmath should only be used if your input does not have any special values like NaN, Inf or subnormals. It also only safe if sum2 is never 0 (because of a division by zero).

Note that this can can be further optimized. One way is use loop tiling so to better use the CPU caches. This can be much faster if the algorithm is memory-bound on the target machine (which is the case on my machine). Another way is to load balance the work between threads. This should make the computation up to twice faster. Another optimization consists in using SIMD instructions. This does not seems possible with Numba (at least not easily nor efficiently). You need to use a native language (e.g. C/C++) to perform such an optimization. This should make the computation up to 4 times faster on most machines. Both optimizations can be combined together to make the computation up to 8 times faster. The resulting program should be able to compute the Jaccard distance in an optimal time (assuming all the distances need to be computed).

Note that using simple precision (and more generally small types), can make this computation up to twice faster (so up to 16 times faster with the previous optimizations).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59