So I'm trying to do something very similar to these 2 posts but there a few differences. One, I don't want a csv file so no csv module and I want to get it done in Python not R.
Convert Adjacency Matrix into Edgelist (csv file) for Cytoscape
Convert adjacency matrix to a csv file
Input:
AF001 AF002 AF003 AF004 AF005
AF001 1.000000 0.000000e+00 0.000000 0.0000 0
AF002 0.374449 1.000000e+00 0.000000 0.0000 0
AF003 0.000347 1.173926e-05 1.000000 0.0000 0
AF004 0.001030 1.494282e-07 0.174526 1.0000 0
AF005 0.001183 1.216664e-06 0.238497 0.7557 1
Output:
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
I have a nonredundant correlation matrix DF_sCorr
which has been processed from a redundant matrix using np.tril
(code courteously provided by @jezrael).
I want to collapse it into a dictionary where the key is a sorted tuple of the samples {i.e. key=tuple(sorted([row_sample,col_sample])
} and the value to be their value.
I wrote an example function below sif_format
which generates a dictionary analogous to a sif format (3 column table in the format sample_x interaction_value sample_y
) but it is taking a very long time.
I thought the best way to organize this type of table would be a dictionary. I feel like there is a much more effecient way to do this. Possibly w/ only processing boolean values? The real dataset I'm using is ~7000x7000
I'm not sure if there is a function w/in numpy
,pandas
,scipy
, or networkx
that can do this type of processing effeciently.
import pandas as pd
import numpy as np
A_sCorr = np.array([[0.999999999999999, 0.0, 0.0, 0.0, 0.0], [0.374449352805868, 1.0, 0.0, 0.0, 0.0], [0.00034743953114899502, 1.17392596672424e-05, 1.0, 0.0, 0.0], [0.00103026903356954, 1.49428208843456e-07, 0.17452569907144502, 1.0, 0.0], [0.0011830950375467401, 1.2166642639889999e-06, 0.238497202355299, 0.75570008659397792, 1.0]])
sampleLabels = ['AF001', 'AF002', 'AF003', 'AF004', 'AF005']
DF_sCorr = pd.DataFrame(A_sCorr,columns=sampleLabels, index=sampleLabels)
#AF001 AF002 AF003 AF004 AF005
#AF001 1.000000 0.000000e+00 0.000000 0.0000 0
#AF002 0.374449 1.000000e+00 0.000000 0.0000 0
#AF003 0.000347 1.173926e-05 1.000000 0.0000 0
#AF004 0.001030 1.494282e-07 0.174526 1.0000 0
#AF005 0.001183 1.216664e-06 0.238497 0.7557 1
def sif_format(DF_var):
D_interaction_corr = {}
n,m = DF_var.shape
for i in range(n):
row_sample = DF_var.index[i]
for j in range(m):
col_sample = DF_var.columns[j]
if row_sample != col_sample:
D_interaction_corr[tuple(sorted([row_sample,col_sample]))] = DF_var.iloc[i,j]
if j==i:
break
return(D_interaction_corr)
D_interaction_corr = sif_format(DF_sCorr)
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
DataFrame.to_dict() won't work for this
DF_sCorr.to_dict()
{'AF002': {'AF002': 1.0, 'AF003': 1.17392596672424e-05, 'AF001': 0.0, 'AF004': 1.49428208843456e-07, 'AF005': 1.2166642639889999e-06}, 'AF003': {'AF002': 0.0, 'AF003': 1.0, 'AF001': 0.0, 'AF004': 0.17452569907144502, 'AF005': 0.238497202355299}, 'AF001': {'AF002': 0.374449352805868, 'AF003': 0.00034743953114899502, 'AF001': 0.999999999999999, 'AF004': 0.00103026903356954, 'AF005': 0.0011830950375467401}, 'AF004': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 1.0, 'AF005': 0.75570008659397792}, 'AF005': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 0.0, 'AF005': 1.0}}