0

I have a large data matrix and I want calculate the similarity matrix of that large matrix but due to memory limitation I want to split the calculation.

Lets assume I have following: For the example I have taken a smaller matrix

data1 = data/np.linalg.norm(data,axis=1)[:,None]

(Pdb) data1
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.04777415,  0.00091094,  0.01326067, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.01503281,  0.00655707, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.00418038,  0.00308079,  0.01893477, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.06883803,  0.        ,  0.0209448 , ...,  0.        ,
         0.        ,  0.        ]])

They I try to do following:

similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])

n1,n2,m1,m2 been calculated as follows: (df is a data frame)

data = df.values
m, k = data.shape
n1=0; n2=m/2; m1=n2+1; m2=m;

But the error is:

(Pdb) similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
*** NameError: name 'similarity_matrix' is not defined
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232

1 Answers1

1

Didn't you do something like

similarity_matrix = np.empty((N,M),dtype=float)

at the start of your calculations?

You can't index an array, on right or left side of an equation, before you create it.

If that full (N,M) matrix is too big for memory, then just assign your einsum value to another variable, and work with that.

partial_matrix = np.einsum...

How you relate that partial_matrix to the virtual similarity_matrix is a different issue.

hpaulj
  • 221,503
  • 14
  • 230
  • 353