Why I can not calculate distance between two numpy array?

Question

1.I had two numpy arrays which are data_test and data_train respectively

    data_partial_test = data_test[:2000,:]
    test_lable = label_test
    print(test_lable.shape)
    print(data_partial_test[0].shape)
    print(data_train[0].shape)
    dis = (( data_partial_test- data_train[:21000,])**2).sum(axis=1)

2.The shape of data_test is (21000,784) and the shape of data_train is(2000,784). When I run this code it said :operands could not be broadcast together with shapes (2000,784) (21000,784)

What is your expected result here? Numpy will try to subtract the values element-wise, but since the shapes are different that is not possible. Maybe this can help: https://stackoverflow.com/questions/50758165/why-the-following-operands-could-not-be-broadcasted-together/50758844 — Shaido, Oct 06 '20 at 05:52
My expected result would be the distance between these two NumPy array — Jay Park, Oct 06 '20 at 06:04
I understand. A simple example to illustrate the problem here: if you have `a = [1,2,3,4,5]` and `b = [1,2]`, and then try to take `a-b` then it won't work (which element should be subtracted with which in this case?). If the shapes are the same, then it element-wise subtraction would work, for example: `a=[1,2]`, `b=[1,2]`, `a-b=[0,0]`. — Shaido, Oct 06 '20 at 06:22

score 0 · Accepted Answer · answered Oct 06 '20 at 06:55

When you perform arrays subtraction, like arr_1 - arr_2, then actually Numpy attempts:

to subtract elements of row 0 in arr_2 from corresponding elements of arr_1 (also in row 0),
the same for row 1,
and so on, up to the end of both arrays.

This scheme works as long as both arrays have the same number of rows and columns.

There are 3 exceptions to this rule:

One of arrays involved can have a single row. Then this row is broadcast (repeated) so that this array has conceptually as many rows as needed.
One of arrays involved can have a single column. Then the broadcasting mentioned above takes place along columns.
One of operands is a single value. Then it is "expanded" to an array with the number of rows / columns like in the other operand (array).

Read about broadcasting in Numpy to have more detailed view on this.

In your case no of the above sitiations takes place. Both arrays have the same number of columns, but the number of rows is different. The consequence is that the above broadcasting can not be performed and the whole operation fails.

Possible solution

Maybe each row in the first array (with smaller number of rows) can be "paired up" with a row in the second array, e.g. based on some key field. Such operation can be performed in Pandas. See for join method in Pandas.

Then you can:

convert your both Numpy arrays to pandasonic DataFrames,
perform a join on these DataFrames (based on a common key, usually set to the index in each DataFrame),
compute differences between proper pairs of columns.

Then you can:

square these differences,
sum them up,
and finally compute the root of the sum, getting the wanted distance.

Why I can not calculate distance between two numpy array?

1 Answers1