Because your data frame is not sorted on the index, which means all the subsetting has to be done with slow vector scan and fast algorithm like binary search can not be applied; While groupby
always sort the data frame by the group by variable first, you can mimic this behavior by writing a simple algorithm that sort the index and then subset to validate this:
def sort_subset(df):
# sort index and find out the positions that separate groups
df = df.sort_index()
split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
list_df = []
for i in range(len(split_indices)-1):
start_index = split_indices[i]
end_index = split_indices[i+1]
list_df.append(df.iloc[start_index:end_index])
return list_df
Some timing:
import pandas as pd
import numpy as np
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))
index = list(set(df.index))
print('no of groups: ', len(index))
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups: 100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)
Compare the result:
all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True
You see a significant speed up if you sort the index before subsetting as well:
def sort_subset_with_loc(df):
df = df.sort_index()
list_df_1 = [df.loc[x] for x in index]
return list_df_1
%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)