3

I have a dataframe where some columns are name-paired (for each column ending with _x there is a corresponding column ending with _y) and others are not. For example:

import pandas as pd
import numpy as np

colnames = [
    'foo', 'bar', 'baz',
    'a_x', 'b_x', 'c_x',
    'a_y', 'b_y', 'c_y',
]
rng = np.random.default_rng(0)
data = rng.random((20, len(colnames)))
df = pd.DataFrame(data, columns=colnames)

Assume I have two lists containing all the column names ending with _x, and all the column names ending with _y (it's easy to build such lists), of the same length m (remember that for each _x column there is one and only one corresponding _y column). I want to create m new columns with a simple formula:

df['a_err'] = (df['a_x'] - df['a_y']) / df['a_y']

without hard-coding the column names, of course. It's easy to do so with a for loop, but I would like to know if it's possible to do the same without a loop, in the hope that it would be faster (the real dataframe is way bigger than this small example).

DeltaIV
  • 4,773
  • 12
  • 39
  • 86
  • 3
    There's always going to be a loop. There are certainly no vectorized methods based on column names. Assuming the columns are long, the overhead of the loop won't be significant. – Tim Roberts May 04 '23 at 21:21
  • @TimRoberts interesting. So you're saying that my `for`-based solution is (nearly) speed-optimal, right? – DeltaIV May 04 '23 at 21:59
  • 1
    Well, there are lots of ways to loop over the column names (as shown in the answers), some smart, some less smart, but assuming you have way more rows than columns, that's never going to dominate the timing. – Tim Roberts May 04 '23 at 22:55

3 Answers3

1

You can use groupby_apply with a custom function:

func = lambda sr: (sr.iloc[:, 0] - sr.iloc[:, 1]) / sr.iloc[:, 1]

# r'' stands for raw strings like f'' for formatted strings
# Keep columns that end ($) with '_x' or '_y' (_[xy])
# Groupby column prefix (a_x -> a, b_x -> b, ..., c_y -> c)
# Apply your formula on each group (a_x, a_y), (b_x, b_y), (c_x, c_y)
err = (df.filter(regex=r'_[xy]$')
         .groupby(lambda x: x.split('_')[0], axis=1)
         .apply(func).add_suffix('_err'))

# Append error columns to your original dataframe
df = pd.concat([df, err], axis=1)

Output:

>>> df
         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

You can also use filter to split your columns:

# Python > 3.8, walrus operator
err = (df.filter(regex='_x$').values - (y := df.filter(regex='_y$'))) / y
err.columns = err.columns.str.split('_').str[0] + '_err'
df = pd.concat([df, err], axis=1)
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I don't like this part: `df.iloc[:, 3:]`. You're assuming that the paired columns start from (column) index 3 onwards. This isn't true in general: it's just a coincidence that this is true for the sample df I wrote. Can you fix your code so that it doesn't depend on hardcoded constants? – DeltaIV May 04 '23 at 21:57
  • I replaced `df.iloc[:, 3:]` with `df.filter(regex=r'_[xy]$')`. Can you check it please? – Corralien May 04 '23 at 22:21
  • the test will have to wait tomorrow. In meantime, could you please tell me what the `r` in `r'_[xy]$'` stands for? It would be great if you could explain in detail what your code is doing. – DeltaIV May 04 '23 at 23:07
1

Since there is a balance between x and y, you can limit the for loop to just on the columns, and still be performant. The solution below uses MultiIndex, and pass the bulk of the processing as much as possible to Pandas:

temp = df.filter(like = '_')
temp.columns = temp.columns.str.split("_", expand = True)
temp = temp.swaplevel(axis='columns')
temp = temp.x.sub(temp.y).div(temp.y).add_suffix('_err')
df.assign(**temp)

         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

If you have the lists already, something like the following will work:

q = ['a', 'b', 'c']
x = ['a_x', 'b_x', 'c_x']
y = ['a_y', 'b_y', 'c_y']
e = ['a_err', 'b_err', 'c_err']


df[e] = (df[x].rename(columns=dict(zip(x,q))) - df[y].rename(columns=dict(zip(y,q)))) / df[x].rename(columns=dict(zip(x,q)))

There might be a more efficient method, as this renames the columns each time so that subtraction and multiplication will match indices.

Rawson
  • 2,637
  • 1
  • 5
  • 14