Fastest way to split a pandas dataframe into a list of subdataframes

Question

I have a large dataframe df for which I have a full list indices of unique elements in df.index. I now want to create a list of all the subdataframes indexed by elements in indices; specifically

list_df = [df.loc[x] for x in indices]

Running this command is taking ages though (df has about 3e6 rows, and 3e3 unique indices). Is this a reasonable way to perform this operation? I would be very happy to receive any kind of comments or suggestions that could improve the performance of this and related problems.

Thanks in advance!

I would be glad if the downvoter could let me know how to improve my question. Thanks! — Giovanni De Gaetano, Oct 10 '17 at 14:55

jezrael · Accepted Answer · 2017-10-10T13:46:15.193

6

You can use list comprehension in groupby object by index - level=0, sort=False change default sorting for faster solution:

L = [x for i, x in df.groupby(level=0, sort=False)]

np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df = pd.DataFrame({'A': np.random.choice(L, N),
                   'B':np.random.randint(10, size=N)}, index=np.random.randint(100, size=N))

In [273]: %timeit [x for i, x in df.groupby(level=0, sort=False)]
100 loops, best of 3: 9.91 ms per loop

In [274]: %timeit [df.loc[x] for x in df.index]
1 loop, best of 3: 417 ms per loop

edited Oct 10 '17 at 13:46

answered Oct 10 '17 at 13:27

jezrael

822,522
95
1,334
1,252

Thanks for the very fast reply! I'm going to try out if this solution is faster. – Giovanni De Gaetano Oct 10 '17 at 13:30
Thanks, it is massively faster! Could you explain why this happens? – Giovanni De Gaetano Oct 10 '17 at 13:36
1

On my personal example: `197.030567884 seconds` for my solution, and `1.07291507721 seconds` for jezrael's solution, which goes down to `0.949830770493 seconds` if `sort=False`. – Giovanni De Gaetano Oct 10 '17 at 13:38
1

Just came across this solution and I love it! Thanks @jezrael! – nilsfl Jul 20 '23 at 10:45

Fastest way to split a pandas dataframe into a list of subdataframes

1 Answers1

Linked