2

I'm working on document clustering where I first build a distance matrix from the tf-idf results. I use the below code to get my tf-idf matrix:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(models)

This results in a matrix of (9069, 22210). Now I want to build a distance matrix from this (9069*9069). I'm using the following code for that:

import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.spatial import distance

arrX = X.toarray()

rowSize = X.shape[0]
distMatrix = np.zeros(shape=(rowSize, rowSize))


#build distance matrix
for i, x in enumerate(arrX):
    for j, y in enumerate(arrX):   
        distMatrix[i][j] = distance.braycurtis(x, y)

np.savetxt("dist.csv", distMatrix, delimiter=",")

The problem with this code is that it's extremely slow for this matrix size. Is there a faster way of doing this?

Mahdi
  • 251
  • 2
  • 12
  • 2
    As long as `distance.braycurtis` only works with individual `x,y` rows there isn't much you can do better. Calling it `rowsize**2` times is the main time consumer. – hpaulj Oct 29 '22 at 19:22

2 Answers2

1

The biggest issue is that the algorithms runs in O(n^3) since each iteration to distance.braycurtis requires the computation of two arrays of size 9069. Since the computation is done 9069*9069 times. This means thousands billion scalar operations are required so to complete the computation. This is huge. The thing is the complexity of the algorithm probably cannot be improved. There are several ways to speed this up:

The first thing to do is not to recompute the distance twice. Indeed, this distance seems to be a commutative operator so distMatrix[i][j] == distMatrix[j][i]. You can compute the upper triangular part and then copy it to the lower triangular part.

Another optimization is simply not to use distance.braycurtis because it is slow: it takes about 10 us/call on my machine. This is mainly because it creates several temporary arrays, is mostly memory-bound because of Numpy operations, and also because np.sum is not very fast (mainly because it uses of a pretty precise algorithm that is hard to optimize). Moreover, it is sequential while nearly all mainstream processor have multiple cores nowadays. We can use Numba so to massively speed up this operation:

import numba as nb

@nb.njit(['float32(float32[::1], float32[::1])', 'float64(float64[::1], float64[::1])'], fastmath=True)
def fastBrayCurtis(arr1, arr2):
    assert arr1.size == arr2.size
    assert arr1.size > 0
    zero = arr1[0] * 0 # Trick to set `zero` to the right type regarding the one of `arr1`
    df, sm = zero, zero
    for k in range(arr1.size):
        df += np.abs(arr1[k] - arr2[k])
        sm += np.abs(arr1[k] + arr2[k])
    return df / sm

# The signature of the function is provided so to compile the function eagerly
# with both 32-bit and 64-bit floating-point 2D contiguous arrays.
@nb.njit(['float32[:,::1](float32[:,::1])', 'float64[:,::1](float64[:,::1])'], fastmath=True, parallel=True)
def brayCurtisDistMatrix(arr):
    n = arr.shape[0]
    distance = np.empty((n, n), dtype=arr.dtype)

    # Compute the distance matrix in parallel while balancing the work between threads
    for i in nb.prange((n+1)//2):
        # Top of the upper triangular part (many items)
        for j in range(i, n):
            distance[j, i] = distance[i, j] = fastBrayCurtis(arr[i], arr[j])
        # Bottom of the upper triangular part (few items)
        for j in range(n-1-i, n):
            distance[j, n-1-i] = distance[n-1-i, j] = fastBrayCurtis(arr[n-1-i], arr[j])

    return distance

This code is about 440 times faster than the initial one on my 6-core i5-9600KF processor. Actually, a quick theoretical analysis combined with profiling results shows that the algorithm is close to be optimal (>75% of the computing power of my processor is used)! If this is not enough, you should consider using the simple-precision implementation. If this is still not enough, you should then also consider writing an optimized GPU code for that (or simply reconsider the need to compute such a huge distance matrix).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    That's indeed a good explanation! What's the point with the assert in brayCurtisDistMatrix function? I commented it out and tried with a simple matrix and it seems the code works on my machine. I'm now going to feed it a large one and see how fast it runs. – Mahdi Oct 30 '22 at 14:29
  • This assertions are there just to avoid crashes when the matrix is not square (ie. make the code safer). Note a Numba crash stop the Python process with no exception nor fallback code so it can be dangerous in a large application. Such assertions can also sometime help the compiler to produce a faster code. – Jérôme Richard Oct 30 '22 at 15:24
  • 1
    Ok, I see the point about assertions, but here I think it is not necessary to check for a square matrix, because the input matrix may have different number of rows and columns. Or am I wrong about it? – Mahdi Oct 30 '22 at 20:16
  • Ha you are right, it can also work for non-square matrices. The assertion seems not needed in the end. – Jérôme Richard Oct 30 '22 at 23:02
-1

You see, the individual elements of the NumPy multidimensional matrix you give in as input are saved in memory in 2 ways. They are: ROW MAJOR COLUMN MAJOR Each has its advantages and disadvantages. You can even control the way it is stored. I hope you find this helpful

  • 1
    No. Calls to `distance.braycurtis` take more than 99% of the time. The access to arrays is done contiguously here. Thus, the access pattern of Numpy arrays is definitively not the problem. – Jérôme Richard Oct 30 '22 at 12:53