2

I am working on a code to calculate distances between each and every string in a row. My code is working good. However, my problem now is in filtering my results for example I have the following resulted data frame :

 nodeA    nodeB   distance_score
  0        0            0
  0        1            95
  0        2           105
  1        0            95
  1        1             0
  1        2            128
    ........

I want to remove one of rows that belongs to the same nodes for example for the pair (0,1) and (1,0) one entry is enough. Based on my experience in Matlab, I could've done so by iterating over two loops and store the elements of each loop in an array and then check if the are already elements in these arrays and remove them. But I dont think this is the optimized way to do it in python since I have a huge data files and doing so will cost a lot

MsCurious
  • 175
  • 1
  • 12

2 Answers2

3

Using np.sort and drop_duplicates

a = df.values.copy()
a[:, :2] = np.sort(a[:, :2], 1)
pd.DataFrame(a, columns=df.columns).drop_duplicates()

Using np.unique with the return_index parameter:

idx = np.unique(np.sort(a[:, :2], 1), axis=0, return_index=True)[1]
df.loc[idx]

For this example, both produce:

   nodeA  nodeB  distance_score
0      0      0               0
1      0      1              95
2      0      2             105
4      1      1               0
5      1      2             128

However, the first answer (while it will always return valid combinations), may return rows that differ from the original DataFrame. Here is an example:

df = pd.DataFrame({'nodeA': [2], 'nodeB': [0], 'distance_score': [100]})

   nodeA  nodeB  distance_score
0      2      0             100

When using np.sort:

a = df.values.copy()
a[:, :2] = np.sort(a[:, :2], 1)
pd.DataFrame(a, columns=df.columns).drop_duplicates()

   nodeA  nodeB  distance_score
0      0      2             100

When using np.unique:

idx = np.unique(np.sort(a[:, :2], 1), axis=0, return_index=True)[1]
df.loc[idx]

   nodeA  nodeB  distance_score
0      2      0             100

As you can see the first approach will flip the order of the combination in this case.

user3483203
  • 50,081
  • 9
  • 65
  • 94
2

Another approach is to use a frozenset of the two values and then use that in a groupby and take the first element in the group, eg:

df.groupby(df[['nodeA', 'nodeB']].apply(frozenset, axis=1), as_index=False).first()

Which'll give you:

   nodeA  nodeB  distance_score
0      0      0               0
1      0      1              95
2      0      2             105
3      1      1               0
4      1      2             128
Jon Clements
  • 138,671
  • 33
  • 247
  • 280