0

I have this dataframe:

           A  B 
0  [0, 1, 2]  1 
1        foo  1 
2     [3, 4]  1

I would like to use explode function for column "A" and then to keep right and fair proportion for each exploded row in case with column "B" . So the result should look like this:

     A  B 
0    0  0.33
0    1  0.33
0    2  0.33
1  foo  1 
2    3  0.5 
2    4  0.5

Would this be possible with the explode function? I would manage to come to this result with for row in data.itertuples(): but the for loop is so slow in case with large dataframe. So do you have idea how to solve this with explode or with some other fast way?

I would be very grateful with any help.

kvetjo
  • 55
  • 4
  • 2
    `edf = df.explode("A"); edf.B = edf.groupby(edf.index).B.transform("size").rdiv(1)`. You didn't show your full attempt as a code, so I don't attempt to explain. Also, idk if this is fast enough for you :) –  Jan 17 '22 at 19:05
  • Fair enough. Thanks anyway. I did not manage to do much with the problem unless I used the for cycle. I'm not sure if its worth to publish this attempt with the for loop to this post when its not connected in any way. – kvetjo Jan 17 '22 at 21:04

3 Answers3

3

You can explode "A"; then groupby the index and transform count method (to count the number of each index) and divide the elements in 'B' by their corresponding index count.

out = df.explode('A')
out['B'] /= out['B'].groupby(level=0).transform('count')

Output:

     A         B
0    0  0.333333
0    1  0.333333
0    2  0.333333
1  foo  1.000000
2    3  0.500000
2    4  0.500000
2

Create a boolean mask which indicates rows where A holds a list:

mask = df['A'].apply(lambda x: isinstance(x, list))

Prepopulate a new column of denominators with the integer 1 (to leave division by these values unchanged):

df['denom'] = 1

For each row where A is a list, overwrite the value of B with the length of the list:

df.loc[mask, 'denom'] = df['A'].str.len()

Explode column A, compute the new value of B, and drop denom:

res = df.explode('A').reset_index(drop=True)
res['B'] = res['B'] / res['denom']
res = res.drop(columns='denom')

Result:

print(res)
     A         B
0    0  0.333333
1    1  0.333333
2    2  0.333333
3  foo  1.000000
4    3  0.500000
5    4  0.500000
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
1

You can use explode, then groupby+apply:

(df.explode('A')
   .assign(B=lambda d: d.groupby(level=0)['B'].apply(lambda s:s/len(s)))
)

output:

     A         B
0    0  0.333333
0    1  0.333333
0    2  0.333333
1  foo  1.000000
2    4  0.500000
2    5  0.500000

input:

df = pd.DataFrame({'A': [[0,1,2], 'foo', [4,5]],
                   'B': [1,1,1]})
mozway
  • 194,879
  • 13
  • 39
  • 75
  • The output you display doesn't match the one your code yields. –  Jan 17 '22 at 19:10
  • I like it is one line answer. However the part of your answer where is ...B=... How can I put there a variable so this line of code would be dynamic and possible to use it inside a function? – kvetjo Jan 17 '22 at 20:52
  • 1
    @kvetjo You can use a dictionary in `assign`: `assign(**{'B': lambda ...})`, now the column name is a string and dynamic ;) – mozway Jan 17 '22 at 21:34