0

I have a numpy ndarray (4 x 4) called 'sim' represents the similarity values between 4 items (a,b,c,d).

array([[ 1.        ,  0.        ,  0.5547002 ,  0.73960026],
       [ 0.        ,  1.        ,  0.        ,  0.66666667],
       [ 0.5547002 ,  0.        ,  1.        ,  0.33333333],
       [ 0.73960026,  0.66666667,  0.33333333,  1.        ]])

dataset_u is a list that contains [a,b,c,d] The following code sorts the array and then identifies the top-3 items (related_count) for each of the item a,b,c,d based on their simlarity values.

related_count =3
dataidx = np.asarray(dataset_u) # a,b,c,d
indices = np.argsort(-sim, axis=1)
result = np.hstack((dataidx[:, None], dataidx[indices]))
m1 = result.shape[0]
mask = np.c_[[True] * m1, result[:, 1:] != result[:, 0, None]]
final_mat = result[mask].reshape(m1, -1)
dfdownload = pd.DataFrame(final_mat[:, 1:related_count], index=final_mat[:, 0])

dfdownload:

enter image description here

How can I modify the above code so that it only consider the values >=0.5 before sorting the array? For example, for item 'a' the expected related items are 'd','c', whereas for items 'b' its related item is only 'd' (0.66666667).

0xc0de
  • 8,028
  • 5
  • 49
  • 75
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30

1 Answers1

0

I'm pretty new to both numpy and pandas so this probably is not the best way to do this, I just hope it leads you to a better solution.

sim_copy = sim.copy()
sim_copy[sim_copy <= 0.5] = 0
bool_sim = np.asarray(sim_copy, dtype=bool)
dfdownload.mask(~bool_sim[:, :-1])
# -1 can be replaced with related_count, but its value seems wrong.

Output

     0    1    2
a    d  NaN    b
b  NaN    a  NaN
c    a  NaN    d
d    a    b  NaN

On a side note, related_count should have value 4 instead of 3, but again I'm not very sure of that :).

0xc0de
  • 8,028
  • 5
  • 49
  • 75