1

This is not a duplicate of:

Fastest way to perform complex search on pandas dataframe

Note: pandas ver 0.23.4

Assumptions: data can be laid out in any order.

I have a list:

L = ['A', 'B', 'C', 'D', 'L', 'M', 'N', 'O']

I also have a dataframe. Col1 and Col2 have several associated columns that have related info I wish to keep. The information is arbitrary so I have not filled it in.

Col1  Col2  Col1Info  Col2Info  Col1moreInfo  Col2moreInfo
 A     B       x         x            x             x
 B     C
 D     C
 L     M
 M     N
 N     O

I am trying to perform a 'search and group' for each element of the list. For example, if we performed a search on an element of the list, 'D', the following group would be returned.

To    From  Col1Info  Col2Info  Col1moreInfo  Col2moreInfo
 A     B       x         x            x             x
 B     C
 D     C

I have been playing around with networkx but it is a very complex package.

MaxB
  • 428
  • 1
  • 8
  • 24
  • 1
    have you heard of `union-find` problem? – Quang Hoang Jun 20 '19 at 12:45
  • 1
    How are your list `L` and df correlated? How do you come to the result you get? Where is the complex code for networkX you tried and that does not work? what is wrong with that code? – Patrick Artner Jun 20 '19 at 12:45
  • @PatrickArtner this list contains all of the elements that I want to 'group'. My thinking is once I get a group, I will drop all elements of that group from the list and keep going. – MaxB Jun 20 '19 at 12:51

1 Answers1

3

You could define a graph using the values from both columns as edges, and look for the connected_components. Here's a way using NetworkX:

import networkx as nx

G=nx.Graph()
G.add_edges_from(df.values.tolist())
cc = list(nx.connected_components(G))
# [{'A', 'B', 'C', 'D'}, {'L', 'M', 'N', 'O'}]

Now say for instance you want to filter by D, you could then do:

component = next(i for i in cc if 'B' in i)
# {'A', 'B', 'C', 'D'}

And index the dataframe where the values from both columns are in component:

df[df.isin(component).all(1)]

   Col1 Col2
0    A    B
1    B    C
2    D    C

The above can be extended to all items in the list, by generating a list of dataframes. Then we simply have to index using the position in which a given item is present in L:

L = ['A', 'B', 'C', 'D', 'L', 'M', 'N', 'O']

dfs = [df[df.isin(i).all(1)] for j in L for i in cc if j in i]
print(dfs[L.index('D')])

   Col1 Col2
0    A    B
1    B    C
2    D    C

print(dfs[L.index('L')])

   Col1 Col2
3    L    M
4    M    N
5    N    O
yatu
  • 86,083
  • 12
  • 84
  • 139
  • the line `G.add_edges_from(df.values.tolist())`....in my case there are more columns than just two. is it possible to select only the ones I want? something like `G.add_edges_from(df['Col1','Col2'].tolist())`? – MaxB Jun 20 '19 at 13:25
  • 1
    Yes @MaxB use `df[['Col1','Col2']]` (index with a list) – yatu Jun 20 '19 at 13:38
  • I think my version of pandas is giving me grief. It gives me the following error `'DataFrame' object has no attribute 'tolist'` – MaxB Jun 20 '19 at 13:42
  • 1
    You need `df[['Col1','Col2']].values.tolist()` @MaxB – yatu Jun 20 '19 at 13:43
  • sorry for the constant troubles... `dfs = [df[df.isin(i).all(1)] for j in L for i in cc if j in i]` raises the following TypeError: `unhashable type: 'dict'` – MaxB Jun 20 '19 at 14:35
  • Sorry for the many comments but I believe this solution is incompatible with pandas 23.4, will update question with that info – MaxB Jun 20 '19 at 14:51