Pairwise Distance with Large NumPy Arrays (Chunking?)

Question

Problem: I have a vector that is approximately [350000, 1] and I wish to calculate the pair wise distance. This results in a [350000, 350000] matrix of integer datatype that does not fit into RAM. I eventually want to end up with a boolean (which fits into RAM) so I am currently doing this one element at a time but this is not very time efficient.

Edit: Standard sklearn and scipy functions do not work because of the size of the data -- but if I can chunk it somehow to use hard disk then I should be able to use these.

Problem Visualised: [a_1, a_2, a_3]^t -> [[a_1 - a_1, a_1 - a_2, a_1 - a_3], [a_2 - a_1, a_2 - a_2, a_2 - a_3], [a_3 - a_1, a_3 - a_2, a_3 - a_3]]

Note that only the upper triangle needs to be calculated as it is symmetric when taking the abs value.

Vectorised Code that Needs Chunking or Alternative Solution: I have found a way to compute the distance (subtraction) between all points that works on a small matrix using broadcasting but need a way to be able to do this on larger matrices without hitting RAM limitations.

Or maybe a better way to the MWE below that is quicker could be suggested?

distMatrix = np.absolute((points[np.newaxis, :, :] - points[:, np.newaxis, :])[:, :, 0])

Other Attempts: I have tried using dask and memmap but still get memory errors so must be doing something wrong. I have also tried memmap and manually chunking the data but do not obtain a full set of results so any help would be most appreciated.

MWE of Current Method:


## Data ##
#Note that the datatype and code may not match up exactly as just creating to demonstrate. Essentially want to take first column and create distance matrix with itself through subtracting, and then take 2nd and 3rd column and create euclidean distance matrix.

data = np.random.randint(1, 5, size=[350001,3])
minTime = 3
maxTime = 4
minDist = 1
maxDist = 2

### CODE ###
n = len(data)

for i in trange(n):
    for j in range(i+1, n):
        #Within time threshold?
        if minTime <= (data[j][idxT] - data[i][idxT]) <= maxTime:
            #Within distance threshold?
            xD = math.pow(data[j][idxX] - data[i][idxX], 2)
            yD = math.pow(data[j][idxY] - data[i][idxY], 2)
            d = math.sqrt(xD + yD)
            #If within  threshold then
            if minDist <= d <= maxDist:
                #DO SOMETHING

Reason: I have time, x_coordinate, y_coordinate vectors for approx 350000 points. I want to calculate the distance between all time points (simple subtraction) and the Euclidean distance between each (x,y) point. I then want to be able to identify all point pairs that are within a time and distance occurrence threshold of each other producing a boolean.

Does this answer your question? [Fastest pairwise distance metric in python](https://stackoverflow.com/questions/20277982/fastest-pairwise-distance-metric-in-python) — taha, May 23 '20 at 10:11
Are you interested in euclidean distances or the absolute based ones? — Divakar, May 23 '20 at 10:35
@Divakar Interested in both but if I get the absolute based ones working then I can calculate the Euclidean from there — Daniel J, May 23 '20 at 10:44

V. Ayrat · Accepted Answer · 2020-05-23T11:17:28.730

2

You can split you array to smaller sized ones and calculate the distances for each pair separately.

splits = np.array_split(data, 10)
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        # do something with m

as the most calculations occur in scipy overhead of python loops will be minimal.

If you boolean array fit into memory and you try to find values which in certain range you can do

import numpy as np
import scipy.spatial.distance


boolean = np.zeros((350, 350), dtype=np.bool_)
a = np.random.randn(350, 2)
splits = np.array_split(a, 10)
shift = splits[0].shape[0]
minDist = -0.5
maxDist = +0.5
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        masked = (minDist <= m) & (m <= maxDist)
        boolean[i * shift: (i + 1) * shift, j * shift : (j + 1) * shift] = masked
        boolean[j * shift : (j + 1) * shift, i * shift: (i + 1) * shift] = masked.T

edited May 23 '20 at 11:17

answered May 23 '20 at 10:48

V. Ayrat

2,499
9
10

@v-ayrat How can I join the m matrices together so that it produces the same as if not split please? – Daniel J May 23 '20 at 11:02
@DanielJ I updated the answer. Is it what you need? – V. Ayrat May 23 '20 at 11:12
Not exactly, but more than enough for me to get what I need to get me on my way. Cheers! – Daniel J May 23 '20 at 11:31
@v-ayrat When I run this with a vector of 350,000 it is incredibly slow - I calculate about 90 hours at this rate and this is just for one Boolean not including my other operations which all in total currently take 20 hours. I have also played with adjsuting the size of the chunks. Is there anyway to speed it up? It would be quicker to iterate through each individually :( – Daniel J May 23 '20 at 13:41
I doubt that this is computation time. Depending on your system it should take at most half hour. You can try to comment out lines with `boolean = ...` and `boolean[j * ...` meaning that you only calculations and not store values in memory. If it is much faster then the problem is with memory. Even boolean matrix of that size should be about 100 Gb so it's not stored in you RAM it's constantly swapping during your calculations. In this case you have 2 options: close all side programs and hope this array fit in your RAM - I doubt this can help. Second: do you further calculations on each – V. Ayrat May 23 '20 at 14:11
chunk without storing all data. – V. Ayrat May 23 '20 at 14:11
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214480/discussion-between-daniel-j-and-v-ayrat). – Daniel J May 23 '20 at 14:15
The problem was with a custom distance function that was slowing down the calculation – Daniel J May 23 '20 at 16:51

Pairwise Distance with Large NumPy Arrays (Chunking?)

1 Answers1