2

I am using Python and Pandas, trying to sum up, in an efficient way, dataframe's values in different rows based on lists of IDs instead of unique IDs.

df:

Name  -  ID  - Related IDs          - Value
z     -  123 - ['aaa','bbb','ccc']  -  10
w     -  456 - ['aaa']              -  20
y     -  789 - ['ggg','hhh','jjj']  -  50
x     -  012 - ['jjj','hhh']        -  60
r     -  015 - ['hhh']              -  15

It will be possible to try to explode each row by the element of the list but it may duplicate the values to sum and it might not be an efficient solution in terms of timing and resources.

```python
f = {'Sum': 'sum'}

df = df.groupby(['Related IDs']).agg(f) 
#it is not working has is matching element wise 
#rather then by element

df = df.reset_index()
```

What I am expecting is a new column "Sum" that sum up the values "Value" of rows which have one or more Related IDs in common. As the following:

Name  -  ID  - Related IDs          - Value - Sum
z     -  123 - ['aaa','bbb','ccc']  -  10  -  30
w     -  456 - ['aaa']              -  20  -  30
y     -  789 - ['ggg','hhh','jjj']  -  50  -  125
x     -  012 - ['jjj','hhh']        -  60  -  125
r     -  015 - ['hhh']              -  15  -  125
m.falconelli
  • 61
  • 11
  • `df['Sum'] = df.groupby(['Related IDs'])['Value'].transform('sum')` – jezrael Jul 05 '19 at 10:12
  • 1
    Trying the below I got an unhashable error type... `df = pd.DataFrame({'Related IDs': [['aaa','bbb','ccc'], ['aaa'], ['ddd','eee','fff']], 'Val': [400, 100, 60]}) print(df) df['Sum'] = df.groupby(['Related IDs'])['Val'].transform('sum')` – m.falconelli Jul 05 '19 at 10:33
  • 1
    @jezrael the dup question does not solve this problem. `list` is not hashable so cannot be `groupby`. – Quang Hoang Jul 05 '19 at 10:36
  • @QuangHoang - lets go answering ;) thank you for comment. – jezrael Jul 05 '19 at 10:39
  • How about label encoding the lists? Or turning them into strings? – Quantum_Something Jul 05 '19 at 11:39
  • It may be a solution, I more concerned about performance then other aspects. List will be more practical to work with (to set and access them) but it I can convert them to string. In such case, however, there should be avoided the possibility that an ID will match several IDs. Example: ID1 = 'abc' ID2 = 'abc' ID3 = 'ABCDE'. With solution such as '''if id in ids''' ID1 may match ID2 and also ID3; while it should only match ID2. – m.falconelli Jul 05 '19 at 12:48

1 Answers1

1

Use networkx with connected_components:

import networkx as nx
from itertools import combinations, chain

#if necessary convert to lists 
df['Related IDs'] = df['Related IDs'].apply(ast.literal_eval)

#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in df['Related IDs']]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('aaa', 'bbb'), ('aaa', 'ccc'), ('bbb', 'ccc'), 
 ('ggg', 'hhh'), ('ggg', 'jjj'), ('hhh', 'jjj'), ('jjj', 'hhh')]

#create the graph from the dataframe
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)

#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}

#create groups by mapping first value of column Related IDs
groups = [node2id.get(x[0]) for x in df['Related IDs']]
print (groups)
[0, 0, 1, 1, 1]

#get sum to new column
df['Sum'] = df.groupby(groups)['Value'].transform('sum')
print (df)
  Name   ID      Related IDs  Value  Sum
0    z  123  [aaa, bbb, ccc]     10   30
1    w  456            [aaa]     20   30
2    y  789  [ggg, hhh, jjj]     50  125
3    x   12       [jjj, hhh]     60  125
4    r   15            [hhh]     15  125
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252