0

I have a pandas dataframe that looks like this

enter image description here

Code to reproduce -

import pandas as pd
df = pd.DataFrame([['sample_1', 'sample_2', 0.2],
                   ['sample_1', 'sample_3', 0.5],
                   ['sample_2', 'sample_4', 0.8]],
                  columns=['SampleA', 'SampleB', 'Num_Differences'])


# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))

# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
   .reindex(index=idx, columns=idx)
   .fillna(0, downcast='infer')
   .pipe(lambda x: x+x.values.T)
 )

I would like to convert it to an array in array like this. This array in array would be the variable called dis_matrix in the multidimensional scaling code below.

[[0     0.2    0.5      0]
 [0.2    0      0     0.8]
 [0.5    0      0       0]
 [0      0.8    0       0]]

How can I get an array in array from the pivoted dataframe above?

My end goal is so that I can apply the MDS code below

mds_model = manifold.MDS(n_components = 2, random_state = 123,
    dissimilarity = 'precomputed')
mds_fit = mds_model.fit(dis_matrix)  
mds_coords = mds_model.fit_transform(dis_matrix) 
                                                                                                                                  
food_names = ['sample 1', 'sample 2', 'sample 3', 'sample 4']
plt.figure()
plt.scatter(mds_coords[:,0],mds_coords[:,1],
    facecolors = 'none', edgecolors = 'none')  # points in white (invisible)
labels = food_names
for label, x, y in zip(labels, mds_coords[:,0], mds_coords[:,1]):
    plt.annotate(label, (x,y), xycoords = 'data')
plt.xlabel('First Dimension')
plt.ylabel('Second Dimension')
plt.title('Dissimilarity among food items')    
plt.show()
nerd
  • 473
  • 5
  • 15
  • 1
    Just access the `values` attribute after the reshape, you will get a 2D numpy array; that's your expected outcome. – ThePyGuy Jul 18 '22 at 08:01

1 Answers1

0

df = df.to_numpy().tolist()
df = ''.join(str(df).split(','))
print(df)

Output

[[0.0 0.2 0.5 0.0] [0.2 0.0 0.0 0.8] [0.5 0.0 0.0 0.0] [0.0 0.8 0.0 0.0]]

nerd
  • 473
  • 5
  • 15