4

I have a large dataframe df for which I have a full list indices of unique elements in df.index. I now want to create a list of all the subdataframes indexed by elements in indices; specifically

list_df = [df.loc[x] for x in indices]

Running this command is taking ages though (df has about 3e6 rows, and 3e3 unique indices). Is this a reasonable way to perform this operation? I would be very happy to receive any kind of comments or suggestions that could improve the performance of this and related problems.

Thanks in advance!

1 Answers1

6

You can use list comprehension in groupby object by index - level=0, sort=False change default sorting for faster solution:

L = [x for i, x in df.groupby(level=0, sort=False)]

np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df = pd.DataFrame({'A': np.random.choice(L, N),
                   'B':np.random.randint(10, size=N)}, index=np.random.randint(100, size=N))

In [273]: %timeit [x for i, x in df.groupby(level=0, sort=False)]
100 loops, best of 3: 9.91 ms per loop

In [274]: %timeit [df.loc[x] for x in df.index]
1 loop, best of 3: 417 ms per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252