6

I have a large dataframe df (~100 columns and ~7 million rows) and I need to create ~50 new variables / columns which are simple transformations of the current variables. One way to proceed would be with many .apply statements (I'm just using transform* as a placeholder for simple transformations such as max or squaring):

df['new_var1'] = df['old_var1'].apply(lambda x : transform1(x))
...
df['new_var50'] = df['old_var50'].apply(lambda x : transform50(x))

Another way would be to first create a dictionary

transform_dict = {
'new_var1' : lambda row : transform1(row),
...,
'new_var50' : lambda row : transform50(row)
}

and then write one .apply combined with .concat:

df = pd.concat([df, 
   df.apply(lambda r: pd.Series({var : transform_dict[var](r) for var in transform_dict.keys()}), axis=1)], axis=1)

Is one method preferred over the other, either in how 'Pythonic' it is, or efficiency, scalability, flexibility?

chriswhite
  • 1,370
  • 10
  • 21
  • 1
    I would expect the second form to be more efficient as the first method is iteratively incrementing the df which will mean many reallocations of memory and data copying as the df grows – EdChum Feb 03 '16 at 15:19
  • 1
    Aside: `apply(lambda row: transform1(row))` is just a slower way to write `apply(transform1)`. – DSM Feb 03 '16 at 15:56

2 Answers2

4

Continuing the experiment of @Stefan but with a size of 100k x 100 and with a new method that first allocates a block of NaNs and concatenates it to the dataframe. It then uses iloc to perform calculations on each column.

def cols_via_iloc(df):
    df = pd.concat([df, pd.DataFrame(np.tile(np.nan, [len(df), 50]))], axis=1)
    for i in range(100, 150):
        df.iloc[:, i] = i * df.iloc[:, i - 100]

def cols_via_apply(df):
    for i in range(100, 150):
        df[i] = df[i-100].apply(lambda x: x * i)
    return df  

def cols_via_concat(df):
    df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i - 100] 
                                                        for i in range(100, 150)}), axis=1)])
    return df

>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
    cols_via_iloc(df)
1 loops, best of 3: 540 ms per loop

>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
    cols_via_apply(df)
1 loops, best of 3: 2.91 s per loop

>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
    cols_via_concat(df)
1 loops, best of 3: 55.8 s per loop
Alexander
  • 105,104
  • 32
  • 201
  • 196
3

Starting with:

df = pd.DataFrame(np.random.random((1000, 100)))

Adding individual columns:

def cols_via_apply(df):
    for i in range(100, 150):
        df[i] = df[i-100].apply(lambda x: x * i)
    return df  

%timeit cols_via_apply(df)

10 loops, best of 3: 29.6 ms per loop

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None

seems quite a bit more efficient than using pd.concat - presumably because there's a loop over the rows of the DataFrame involved. So this difference will get worse as the DataFrame gets longer:

def cols_via_concat(df):
    df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i-100] for i in range(100, 150)}), axis=1)])
    return df


%timeit cols_via_concat(df)

1 loops, best of 3: 450 ms per loop

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None
Stefan
  • 41,759
  • 13
  • 76
  • 81