How to perform operations on very big torch tensors without splitting them

Question

My Task:

I'm trying to calculate the pair-wise distance between every two samples in two big tensors (for k-Nearest-Neighbours), That is - given tensor test with shape (b1,c,h,w) and tensor train with shape (b2,c,h,w), I need || test[i]-train[j] || for every i,j. (where both test[i] and train[j] have shape (c,h,w), as those are sampes in the batch).

The Problem

both train and test are very big, so I can't fit them into RAM

My current solution

For a start, I did not construct these tensors in one go - As I build them, I split the data Tensor and save them separately to memory, so I end up with files {Test\test_1,...,Test\test_n} and {Train\train_1,...,Train\train_m}. Then, I load in a nested for loop every Test\test_i and Train\train_j, calculate the current distance, and save it.

This semi-pseudo-code might explain

test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1)) 
all_distances = []
for test_i in test_files:
    test_i = torch.load(test_i) # shape (c,h,w)
    dist_of_i_from_all_j = torch.Tensor([])
    for train_j in train_files:
        train_j = torch.load(train_j) # shape (c,h,w)
        dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
    all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances

What I thought might work

I came across FAISS repository, in which they explain that this process can be sped up (maybe?) using their solutions, though I'm not quite sure how. Regardless, any approach would help!

CAPSLOCK · Accepted Answer · 2022-06-29T14:28:57.407

2

Did you check the FAISS documentation?

If what you need is the L2 norm (torch.cidst uses p=2 as default parameter) then it is quite straightforward. Code below is an adaptation of the FAISS docs to your example:

import faiss
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(x_test)                  # add vectors to the index
print(index.ntotal)

k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k)     # actual search
print(I[:5])                   # neighbors of the 100 first queries
print(I[-5:])                  # neighbors of the 100 last queries

edited Jun 29 '22 at 14:28

answered Jun 29 '22 at 08:14

CAPSLOCK

6,243
3
33
56

Thank you for your answer. Am I obliged to have a 2d tensor as you've specified for x_test and x_train? because I'm using a tensor with 4 dimensions – Hadar Jun 29 '22 at 09:17
@Hadar what is in the other dimensions? could you clarify what is the shape of the tensor or what these dimensions represent? – CAPSLOCK Jun 29 '22 at 09:57
The tensor represents activation maps of ResNet, given input from CIFAR10. I'm using multiple activations so the dimensions might vary, but one of them has shape (b,c,h,w)=(60000,512,7,7) - this is the feature map of the last non fully-connected train data propagated through the network – Hadar Jun 29 '22 at 10:02
@Hadar Sorry, my question was not really apt. What I am trying to understand is: do you need to use dimensions h and w when computing your distances or should they be handled separately? in the example I adapted, we have nb observations (your b), each with 64 dimensions (features, attributes, characteristics) (your c) and we use only those 64 dimensions to compute the similarity of each observation with the others. So, how would you use h and w in this context? – CAPSLOCK Jun 29 '22 at 10:08
So theoretically, given two tensors `t1` and `t2` with shapes `(c,h,w)`, the distance between them should be a scalar. Initially, I flattened these tensors to have shape `(c * h * w)`, but this might not be the best option (perhaps you might have ideas of how to measure the distance between multi-dimension tensors in a more logical way, other than flattening) – Hadar Jun 29 '22 at 10:13
1

@Hadar that's why I asked. You can flatten in the same way before feeding the tensor to FAISS. But I think it would be more appropriate to use an [Lp-norm](https://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_dimensions) (with p = 4). FAISS has a method to compute Lp metric. https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances Unfortunately I don't feel knowledgeable enough to give you a strong suggestion on what is the best approach here. – CAPSLOCK Jun 29 '22 at 10:20
I will have a look. thank you for your answer – Hadar Jun 29 '22 at 10:23
@Hadar I would appreciate if you let me know how you end up implementing it. Feel free to edit the answer with your findings – CAPSLOCK Jun 29 '22 at 11:17
1

I've added my version of the solution! – Hadar Jul 05 '22 at 14:51

score 1 · Answer 2 · answered Jul 05 '22 at 14:49

Consequently, I chose to implement some version of the Earth-Movers-Distance, as was suggested in the following ai.StackExchange post. Let me summarize the approach:

Given the task as described in "My Task" above, I defined

def cumsum_3d(test, train):
    for i in [-1, -2, -3]:
        test = torch.cumsum(test, i)
        train = torch.cumsum(train, i)
    return test, train

then, given the tensors test and train:

test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))

For future viewers - bare in mind that:

I did not use FAISS because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any other version of multidimensional (=shape (c,h,w) like in my example) tensors distance. To account for the RAM problem I've used Google Colab and sliced my data to more files
This implementation was only relevant as I was dealing with shallow activation layers. If I were to use the last layer (avgpool) as my activations, It would have been fine not using the EMD, as the output right after the avgpool has shape (512,)

How to perform operations on very big torch tensors without splitting them

2 Answers2