Assignments with a Pandas DataFrame with varying float32 and float64 datatypes are for some combinations rather slow the way I do it.
The code below sets up a DataFrame, makes a Numpy/Scipy computation on part of the data, sets up a a new DataFrame by copying the old one and assigns the result from the computation to the new DataFrame:
import pandas as pd
import numpy as np
from scipy.signal import lfilter
N = 1000
M = 1000
def f(dtype1, dtype2):
coi = [str(m) for m in range(M)]
df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)],
columns=coi + ['A', 'B'], dtype=dtype1)
Y = lfilter([1], [0.5, 0.5], df.ix[:, coi])
Y = Y.astype(dtype2)
new = pd.DataFrame(df, copy=True)
print(new.iloc[0, 0].dtype)
print(Y.dtype)
new.ix[:, coi] = Y # This statement is considerably slow
print(new.iloc[0, 0].dtype)
from time import time
dtypes = [np.float32, np.float64]
for dtype1 in dtypes:
for dtype2 in dtypes:
print('-' * 10)
start_time = time()
f(dtype1, dtype2)
print(time() - start_time)
The timing result is:
----------
float32
float32
float64
10.1998147964
----------
float32
float64
float64
10.2371120453
----------
float64
float32
float64
0.864870071411
----------
float64
float64
float64
0.866265058517
Here the critical line is new.ix[:, coi] = Y
: It is ten times as slow for some combinations.
I can understand that there needs to be some overhead for reallocation when there is a float32 DataFrame and it is assigned a float64. But why is the overhead so dramatic.
Furthermore, the combination of float32 and float32 assignment is also slow and the result is float64, which also bothers me.