2

I'm adding a column to a dataframe where the column values are determined by comparing two other columns in the dataframe. The code to add the column is:

lst = []
for x in range(len(df)):
    if df['ColumnA'][x] > df['ColumnB'][x]:
        lst.append(df['ColumnB'][x])
    else:
        lst.append(df['ColumnA'][x])

df['ColumnC'] = lst

My question is, is there a more efficient/pythonic way to do this? I have been advised in the past to be wary if I'm every looping through every row of a dataframe, so wanted to see if I was missing something. Thank you!

jpp
  • 159,742
  • 34
  • 281
  • 339
Arjun Arun
  • 313
  • 2
  • 9
  • Possible duplicate of [Find the max of two or more columns with pandas](https://stackoverflow.com/questions/12169170/find-the-max-of-two-or-more-columns-with-pandas) – G. Anderson Feb 01 '19 at 18:14

3 Answers3

6

Yes, just take the minimum:

df['ColumnC'] = df[['ColumnA', 'ColumnB']].min(1)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • do you know how this compares perf. wise to np.where off the top? No worries if not, good solution! – Arjun Arun Feb 01 '19 at 18:27
  • Probably `np.where`, but don't prematurely optimize.. Write good, understandable code. Then discover bottlenecks and optimize them. – jpp Feb 01 '19 at 18:31
2

Use numpy.where

df['ColumnC'] = np.where(df['ColumnA'] > df['ColumnB'], df['ColumnB'], df['ColumnA'])
Sociopath
  • 13,068
  • 19
  • 47
  • 75
0

A bit more code than other solutions but arguably more generalizable

mask = df[ColumnA] > df[ColumnB]
df[ColumnC] = pd.Series(index=df.index)
df[ColumnC].loc[mask] = df[ColumnA].loc[mask]
df[ColumnC].loc[~mask] = df[ColumnB].loc[~mask]
Alex
  • 12,078
  • 6
  • 64
  • 74