2

This is stumping me with pandas.. I have a data frame with 5.8M rows and a date index. I 5 columns A, B, C, D & E and would simply like to create a new column F_Score based on simple math below:

F_Score=0
if A > = B:
    F_Score = 1.0 
else:
    F_Score= -1.0

if B > = C:
    F_Score = F_Score + 1.0 
else:
    F_Score = F_Score - 1.0

if C > = D: 
    F_Score = F_Score + 1.0 
else:
    F_Score = F_Score-1.0

if D > = E:
    F_Score = F_Score + 1.0 
else:
    F_Score = F_Score -1.0

I cannot seem to find a simple way to do this. Any suggestions?

I tried turning all of the columns into numpy arrays and then doing a for loop storing the result into a new array and then creating a column F for the dataframe. This seems very inefficient. Is there a better way?

ScottC
  • 3,941
  • 1
  • 6
  • 20
  • Yes, use `np.select`. if you create some mock data, SO will give you a good verifiable solution. Or even nested `np.where`. – Scott Boston Dec 28 '22 at 15:23

2 Answers2

3
np.sum(-np.sign(np.diff(df.to_numpy(), axis=1)), axis=1)

From the if chains, it seems like if we take the column-wise successive differences and look at their sign:

  • if sign is positive, subtract 1
  • if sign is negative, add 1 (note that the difference is taken from right to left).

So we go to the NumPy domain, take the column-wise difference, get the signs, negate them, and their summation is the desired F_score here.

pandas supports diff too, but since the problem is label-agnostic and you mentioned you have a lot of rows, going to NumPy domain should make it faster, and hopefully fast enough.

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
0

You could put your math in a function and apply it with apply: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

should look something like that:

def logic(a,b,c,d,e):
    do something...
    return f

df["f"] = df.apply(lambda row: logic(row["a"], row["b"], row["c"], row["d"], row["e"]), axis=1)
Schulzjo
  • 148
  • 2
  • 6