Python - vector to function is slower than calling for loop on each element. Numpy slower than native Python

Question

I am working on code to perform a transient simulation calling a function for each second of the day and eventually whole year. I thought I found an opportunity to speed up the code by passing a vector of inputs instead of calling a for loop, however, my code runs slower when I do this and I don't understand why.

I would hope for the vector to be only slightly slower than calling the for loop one time to achieve my targeted speed up.

Can you please help explain and/or solve this issue? I have shown three ways in my sample below which is a simplification of the larger program.

Inside the function is the Eg variable which is currently set to zero. If I do Eg=float(0) or Eg=np.array([0,0,0,0,0]) the code runs slower and I assume this is the same issue as the larger question.

The results from the code below is:

Execution time for numpy vector is 716.225 ms
Execution time for 'for-loop' 6 calls is 389.87 ms
Execution time for numpy float32 'for-loop' is 3906.9069999999997 ms

Code sample:

from datetime import datetime, timedelta
import numpy as np

def Q_Walls( A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p):
    
    Ein = (T_inf_outside - T_p) * A / (R2/2 + R3 + R4) # convection and conduction only
    Eout = (T_p - T_inf_inside) * A / (R1 + R2/2) # convection and conduction only
    Eg = 0
    Enet = Eg + Ein - Eout
    T_p1 = (Enet * dt / (m * cp) + T_p) # average bulk temperature of wall after time dt
    T2_surf = (T_p - Eout * R2/2 / A)
    
    return T_p1, Eout, T2_surf

def Q_Walls_vect( A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p):
    
    Ein = (T_inf_outside - T_p) * A / (R2/2 + R3 + R4) # convection and conduction only
    Eout = (T_p - T_inf_inside) * A / (R1 + R2/2) # convection and conduction only
    Eg = 0 #np.array([0,0,0,0,0], 'float64')
    Enet = Eg + Ein - Eout
    T_p1 = (Enet * dt / (m * cp) + T_p) # average bulk temperature of wall after time dt
    T2_surf = (T_p - Eout * R2/2 / A)
    
    return T_p1, Eout, T2_surf



A= R1= R2= R3= R4= m= cp= np.array([1,1,1,1,1], 'float32')
dt= np.array([1,1,1,1,1], 'float32')
T_inf_inside = np.array([250,250,250,250,250], 'float32')
T_inf_outside = np.array([250.2,250.2,250.2,250.2,250.2], 'float32')
T_p_wall = np.array([250.1,250.1,250.1,250.1,250.1], 'float32')

t_max =87000

begin_time = datetime.now()


for x in np.arange(t_max):
    T_p_wall, Enet_wall, Tinside_surf = Q_Walls_vect(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall)
    
end_time = (datetime.now() - begin_time)
print(f"Execution time for numpy vector is {end_time.total_seconds()*1000} ms")

A= R1= R2= R3= R4= m= cp= float(1.1)
dt= float(1)
T_inf_inside = float(250.01)
T_p_wall = float(250.1)
T_inf_outside = float(250.2)

begin_time = datetime.now()


for x in np.arange(t_max):
    for j in range(6):
        T_p_wall, Enet_wall, Tinside_surf = Q_Walls(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall)
    
end_time = (datetime.now() - begin_time)
  
print(f"Execution time for 'for-loop' 6 calls is {end_time.total_seconds()*1000} ms")


A= R1= R2= R3= R4= m= cp= np.float32(1.1)
dt= 1
T_inf_inside = np.float32(250.01)
T_p_wall = np.float32(250.1)
T_inf_outside = np.float32(250.2)

begin_time = datetime.now()


for x in np.arange(t_max):
    for j in range(6):
        T_p_wall, Enet_wall, Tinside_surf = Q_Walls(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall)
    
end_time = (datetime.now() - begin_time)
print(f"Execution time for numpy float32 'for-loop' is {end_time.total_seconds()*1000} ms")

I have some suggestions for your time measurements. Use timeit, this is made for the purpose of measuring execution time. Otherwise something like garbage collection could mess up your results. Furthermore why are there (only) six entries in the array? Numpy will be faster if you deal with much bigger data > O(10^3) — Las, Jun 15 '21 at 21:52

score 2 · Answer 1 · answered Jun 15 '21 at 23:38

There are multiple issues occurring in the code:

Numpy is quite fast for big arrays but not for very small arrays as creating/allocating/freeing temporary arrays is expensive as well as calling native Numpy functions from the Python interpreter.
integer-typed and float32-typed variables are promoted to float64 when you perform such binary operations: [int] BIN_OP [float32] and [float32] BIN_OP [float64] and with a reverse order. This causes more temporary arrays to be created and several implicit conversions to be done, making the code significantly slower.
CPython loops are very slow because CPython is an interpreter.

The second point can be fixed using the following example code:

f32_const_0 = np.float32(0)
f32_const_2 = np.float32(2)

def Q_Walls_float32( A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p):
    Ein = (T_inf_outside - T_p) * A / (R2/f32_const_2 + R3 + R4) # convection and conduction only
    Eout = (T_p - T_inf_inside) * A / (R1 + R2/f32_const_2) # convection and conduction only
    Eg = f32_const_0
    Enet = Eg + Ein - Eout
    T_p1 = (Enet * dt / (m * cp) + T_p) # average bulk temperature of wall after time dt
    T2_surf = (T_p - Eout * R2/f32_const_2 / A)
    
    return T_p1, Eout, T2_surf

You can mitigate the cost with Numba (or Cython), but the best is not to use Numpy array for only few elements, or actually to directly do the computation element-wise in Numba so that no a lot of temporary array are created.

Here is an example of Numba code:

from datetime import datetime, timedelta
import numpy as np
import numba as nb

A= R1= R2= R3= R4= m= cp= float(1.1)
dt= float(1)
T_inf_inside = float(250.01)
T_p_wall = float(250.1)
T_inf_outside = float(250.2)

@nb.njit(nb.types.UniTuple(nb.float64,3)(nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, 
                                            nb.float64, nb.float64, nb.float64, nb.float64))
def Q_Walls( A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p):
    Ein = (T_inf_outside - T_p) * A / (R2/2 + R3 + R4) # convection and conduction only
    Eout = (T_p - T_inf_inside) * A / (R1 + R2/2) # convection and conduction only
    Eg = 0
    Enet = Eg + Ein - Eout
    T_p1 = (Enet * dt / (m * cp) + T_p) # average bulk temperature of wall after time dt
    T2_surf = (T_p - Eout * R2/2 / A)
    
    return (T_p1, Eout, T2_surf)

@nb.njit(nb.types.UniTuple(nb.float64,3)(nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, nb.float64, 
                                            nb.float64, nb.float64, nb.float64, nb.float64))
def compute_with_numba(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall):
    for x in np.arange(t_max):
        for j in range(6):
            T_p_wall, Enet_wall, Tinside_surf = Q_Walls(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall)
    return (T_p_wall, Enet_wall, Tinside_surf)

begin_time = datetime.now()

T_p_wall, Enet_wall, Tinside_surf = compute_with_numba(A, R1, R2, R3, R4, m, cp, T_inf_inside, T_inf_outside, dt, T_p_wall)
    
end_time = (datetime.now() - begin_time)
  
print(f"Execution time for 'for-loop' 6 calls is {end_time.total_seconds()*1000} ms")

Here are timing results on my machine:

Initial execution:

Execution time for numpy vector is 758.232 ms
Execution time for 'for-loop' 6 calls is 256.093 ms
Execution time for numpy float32 'for-loop' is 3768.253 ms

----------

Fixed execution (Q_Walls_float32):

Execution time for numpy float32 'for-loop' is 839.016 ms

----------

With Numba (compute_with_numba):

Execution time for 'for-loop' 6 calls is 6.311 ms

Python - vector to function is slower than calling for loop on each element. Numpy slower than native Python

1 Answers1