Pandas - Calculate New Value Based on Cross Reference with Another Column

Question

I'm trying to calculate new values in a column whose values are cross-referenced to another column.

>>> import pandas as pd
>>> df = pd.DataFrame( {"A":[0., 100., 80., 40., 0., 60.], 
                        "B":[12,  12,   3,  19,  3,  19]} )
>>> df
       A   B
0    0.0  12
1  100.0  12
2   80.0   3
3   40.0  19
4    0.0   3
5   60.0  19

I want to find all values in column A that are 0, find out the corresponding value in column B, then change all column A values that have the same column B value, according to some function. For instance in the example above I would like to change the first two values of column A, df.A[0] and df.A[1], respectively 0. and 100., into 0.5 and 99.5, because df.A[0] is 0. and it has the same value df.B[0] = 12 in column B as df.B[1] = 12.

df
      A   B
0   0.5  12
1  99.5  12
2  79.5   3
3  40.0  19
4   0.5   3
5  60.0  19

I tried chaining loc, aggregate, groupby and mask functionalities, but I'm not succeeding. Is the only way through a for loop?

EDIT: Broadened example to better illustrate intent.

How did you calculate the values `0.5` and `99.5`, there has to be some logic behind it. — Erfan, Apr 04 '19 at 22:49
@gmds, the most elementary function would be: if value=0 then calculated value = 0.5 and cross referenced value = previous value - 0.5 — gciriani, Apr 04 '19 at 23:35
@gciriani so is it guaranteed that where `df[A]==0`, all values of `df[B]` are the same? — gmds, Apr 04 '19 at 23:44
@gmds, yes for now where df[A]==0, all values of df[B] are the same; in future I may have to change the function, but I'll cross that bridge when necessary. — gciriani, Apr 05 '19 at 00:21

gmds · Answer 1 · 2019-04-05T01:27:10.673

1

This will work:

import pandas as pd

df = pd.DataFrame( {"A":[0., 100., 40., 60.], "B":[12, 12, 19, 19]} )

def f(series):
    return (series + 0.5).where(series == 0, series - 0.5)

B_value = df.loc[df['A'] == 0, 'B'][0]
df.loc[df['B'] == B_value, 'A'] = df.loc[df['B'] == B_value, 'A'].transform(f)

print(df)

Output:

      A   B
0   0.5  12
1  99.5  12
2  40.0  19
3  60.0  19

You can pass an arbitrary function into transform.

There might be a cleaner way to do this; it strikes me as slightly messy.

edited Apr 05 '19 at 01:27

answered Apr 05 '19 at 01:12

gmds

19,325
4
32
58

It doesn't seem to be scaling to the whole data frame. If there is more than one 0.0 value it doesn't work. For instance when the starting df has a 0.0 in the location where there is now 60.0, the result for column A is [0.5, 99.5, 40.0, 0.0]; I instead aim to obtain [0.5, 99.5, 39.5, 0.5]. – gciriani Apr 05 '19 at 01:56
@gciriani but didn't you say that where `df['A'] == 0`, all values of `df['B']` are the same? – gmds Apr 05 '19 at 02:02
I think I misunderstood your early comment. the function you built has to apply to all identical values of df['B'] as if they were in a group. Therefore pd.Groupby() may be the right way to tackle the problem. I broadened the example to clarify the intent. – gciriani Apr 05 '19 at 13:15
I posted something that works below; it borrows your function and the use of transform, thanks; perhaps you can improve some the way it works. – gciriani Apr 05 '19 at 22:41

gciriani · Accepted Answer · 2019-04-06T11:51:27.050

I found a working solution, although probably sub-optimal. I chain groupby, filter and transform to obtain a desired series, and then replace the result in the original dataframe.

import pandas as pd
df = pd.DataFrame( {"A":[0., 100., 80., 40., 0., 60.], 
                    "B":[12,  12,   3,  19,  3,  19]} )
u = ( df.groupby(by="B",  sort=False)
         .filter(lambda x: x.A.min() == 0, dropna=False)
         .A.transform( lambda x: (x+0.5).where(x == 0, x - 0.5) ) 
    )
df.loc[pd.notnull(u), "A"] = u

gives the following results

print("\ninitial df\n",df,"\n\nintermediate series\n",u,"\n\nfinal result",df)

initial df
        A   B
0    0.0  12
1  100.0  12
2   80.0   3
3   40.0  19
4    0.0   3
5   60.0  19

intermediate series
 0     0.5
1    99.5
2    79.5
3     NaN
4     0.5
5     NaN
Name: A, dtype: float64

final result       A   B
0   0.5  12
1  99.5  12
2  79.5   3
3  40.0  19
4   0.5   3
5  60.0  19

Pandas - Calculate New Value Based on Cross Reference with Another Column

2 Answers2

Linked