Python, Pandas: Vectorization opportunities + Avoiding nested loops?

Question

Following is my code that currently uses two loops to process an input df over num iterations defined by outer loop and compare against a random sequence of numbers generated inside inner loop.

While the current approach gives me the output correctly, I suspect this could be done in a better way, particularly for cases where number of iterations in outer loop is more than a few million and num columns in df are close to a hundred.

I wanted to know if I might be missing a trick or two, that I can try to implement.

# Input df - index is same length as num iterations for inner loop defined below
# 'cumuluative' column value is used for comparison against random number inside inner loop 
# 'units_A' is useful data captured from each iteration of inner loop that is aggregated after exiting inner loop
df_reference = pd.DataFrame(index=np.arange(1,11,1),data={'cumulative':np.arange(0.1,1.1,0.1),'units_A':np.arange(10,101,10)})

# Variable that determines num rows in output df
num_iterations_outer = 20
# Variable that determines number of iterations for inner loop operation
num_iterations_inner = 10
# Create an empty output df that will be updated at end
df_out = pd.DataFrame(columns=['cumulative','units_A'])

# Using np array for comparison inside loop instead of comparing against column which takes much longer
compare_against_arr = df_reference['cumulative'].values
# Create a list to store df's that will become rows of output df. This is done to store to list and concat once vs. concat each df at a time within loop
output_df_rows_list = []

for outer_iteration_num in np.arange(num_iterations_outer):
    #current_cumulative_val = 1
    # Rotation num is reset to 1 at the start of every outer interation
    current_rotation_num = 1
    # Create an empty list to store all rotation_num that are generated from inner loop iteration
    rotations_list = []
    for inner_iteration_num in np.arange(1,num_iterations_inner+1):
        # Get a random number between (0.0,1.0]
        comparator = np.random.random()
        # Add the current rotation num to the list created before entering inner loop. Use the rotations list to get corresponding units_A after exiting inner loop
        rotations_list.append(current_rotation_num)
        # Compare random num 'comparator' to cumulative value corresponding to current rotation
        if(comparator < compare_against_arr[current_rotation_num]):
            # Reset rotation_num back to 1
            current_rotation_num = 1
        else:
            # Increment rotation_num
            current_rotation_num += 1
    df_units_A_by_rotation = df_reference.reindex(rotations_list)
    df_units_A_agg_outer_iter = pd.DataFrame(data=df_units_A_by_rotation.sum()).transpose()    
    output_df_rows_list.append(df_units_A_agg_outer_iter)

#  Output df is created by concatenating all df stored in list that was updated in outer loop above
df_out = pd.concat(output_df_rows_list)
# Reset index so that it matches num_outer_iterations
df_out.index = np.arange(num_iterations_outer)

I appreciate your time, and thank you for taking a look!

This doesn't really answer your question, but don't use `np.arange` to loop over a range, use the built in `range`. It will be much faster — juanpa.arrivillaga, Jul 16 '18 at 16:03
@juanpa.arrivillaga Thank you! That is a massive improvement. Could you explain why? When I searched before using np.arange I found the top SO result that seemed to indicate otherwise: https://stackoverflow.com/questions/10698858/built-in-range-or-numpy-arange-which-is-more-efficient — pazza, Jul 16 '18 at 16:39
Because it materialized the entire `numpy.ndarray` object, which is unnecessary and slow, whereas `range` objects don't create lists and lazily provide the values as you iterate. Iterating over a `numpy.ndarray` object using a Python-level for-loop is very slow (much slower than looping over a `list` or `range` object). Use `numpy.ndarray` for *vectorized operations*. The top answer in that link actually states this and shows different timing results of iterating over `range` (`xrange` in Python 2) vs iterating over the array that results from `np.arange` — juanpa.arrivillaga, Jul 16 '18 at 18:06

Python, Pandas: Vectorization opportunities + Avoiding nested loops?

0 Answers0