19

This is a follow up question to this one, where jezrael used pandas.DataFrame.groupby to increment by a factor of some hundreds the speed of a list creation. Specifically, let df be a large dataframe, then

index = list(set(df.index))
list_df = [df.loc(x) for x in index]

and

list_df = [x for i,x in df.groupby(level=0, sort=False)]

produce the same result, with the latter being more than 200 times faster than the former, even ignoring the list creation step. Why?

I would be very glad if someone could let me understand why there is such a massive performance difference. Thanks in advance!

Edit: as suggested by Alex Riley in his comment, I confirm that the tests have been run on a dataframe with non-unique and non-monotonic index.

  • 1
    I believe the how is more interesting than the why. – Ignacio Vergara Kausel Oct 10 '17 at 15:25
  • 2
    It looks like you have a non-unique index, possibly also non-monotonic. In such degenerate cases, with each call to `loc`, I believe pandas has to iterate over the *entire* index to build a new array (of the same length as the index) to use for boolean indexing. OTOH, `groupby` just scans the index once and keeps track of the integer locations for each label. I'd have to double check everything in the source to be certain. – Alex Riley Oct 10 '17 at 15:36

1 Answers1

12

Because your data frame is not sorted on the index, which means all the subsetting has to be done with slow vector scan and fast algorithm like binary search can not be applied; While groupby always sort the data frame by the group by variable first, you can mimic this behavior by writing a simple algorithm that sort the index and then subset to validate this:

def sort_subset(df):
    # sort index and find out the positions that separate groups
    df = df.sort_index()
    split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
    list_df = []
    for i in range(len(split_indices)-1):
        start_index = split_indices[i]
        end_index = split_indices[i+1]
        list_df.append(df.iloc[start_index:end_index])
    return list_df

Some timing:

import pandas as pd
import numpy as np
​
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))

index = list(set(df.index))
print('no of groups: ', len(index))
​
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups:  100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)

Compare the result:

all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True

You see a significant speed up if you sort the index before subsetting as well:

def sort_subset_with_loc(df):
    df = df.sort_index()
    list_df_1 = [df.loc[x] for x in index]
    return list_df_1

%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Psidom
  • 209,562
  • 33
  • 339
  • 356