7

I work with apis that return large pandas dataframes. I'm not aware of a fast way to iterate through the dataframe directly so I cast to a dictionary with to_dict().

After my data is in dictionary form, the performance is fine. However, the to_dict() operation tends to be a performance bottleneck.

I often group columns of the dataframe together to form multi-index and use the 'index' orientation for to_dict(). Not sure if the large multi-index drives the poor performance.

Is there a faster way to cast a pandas dataframe? Maybe there is a better way to iterate directly over the dataframe without any cast? Not sure if there is a way I could apply vectorization.

Below I give sample code which mimics the issue with timings:

import pandas as pd
import random as rd
import time

#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)

#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))

#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
    for row in df_pivot[column].index:
        test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))


#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))

#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
    for column in df_pivot[row]:
        test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))

Thank you!

  • Thanks for your feedback jpp. I added some timings in my post. Also, I tend to pivot on my data with multi-index. Not sure if that is the bottleneck. Maybe there is a better way to do the pivot. – Anthony Edward Maylath Jan 26 '19 at 18:55
  • Thanks jpp. I timed the code in exactly the same way as described in your link, I guess you are looking for the exact numbers? Let me see if I can create a toy example which shows the problem. – Anthony Edward Maylath Jan 26 '19 at 19:06
  • 1
    Posted a toy example which mimics my issue. Dataframe iteration is slow so I cast to a dictionary. The cast takes time as well. – Anthony Edward Maylath Jan 26 '19 at 20:03

1 Answers1

6

The common guidance is don't iterate, use functions on all rows columns, or grouped rows/columns. Below, in the third code block shows how to iterate over the numpy array whhich is the .values attribute. The results are:

Pivot Construction takes: 0.012315988540649414

Dataframe iteration takes: 0.32346272468566895

Iteration over values takes: 0.004369020462036133

Cast to dictionary takes: 0.023524761199951172

Iteration over dictionary takes: 0.0010480880737304688

import pandas as pd
from io import StringIO 

# Test data
import pandas as pd
import random as rd
import time

#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)

#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))

#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
    for row in df_pivot[column].index:
        test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))

#Iterate over all values in pivot table
t0 = time.time()
v = df_pivot.values
for row in range(df_pivot.shape[0]):
    for column in range(df_pivot.shape[1]):
        test = v[row, column]
t1 = time.time()
print('Iteration over values takes: ' + str(t1-t0))


#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))

#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
    for column in df_pivot[row]:
        test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19
  • Very nice solution! If I understand correctly, the key is to use the `values` field of the dataframe and iterate over that. However, it seems I lose the index and column names if I work with the numpy array. – Anthony Edward Maylath Jan 27 '19 at 00:12
  • Yes - its just a 2d array. But, generally iterating each element is not needed. Most operations can be accomplished without iterating. If you provide an example of the kind of calculations you do, perhaps we can find another way. – Andrew Lavers Jan 27 '19 at 00:30