How to summarize large dataframes in python pandas (50 columns x 2m rows)

Question

For a project i manipulate a few columns of the dataset and afterwards join these newly created columns back to the entire dataset and then summarize on the manipulated fields.

The manipulation and merging is no problem, but the groupby feature doesn't return me any results. I'm wondering how i can find out why it doesn't return me anything. It loads the code and then the result is printed in Jupyter notebook, which only includes the columns i requested but 0 rows returned.

Is there any limitation in columns when using the groupby feature? - I'm using 40 groupby columns and 10 fields amount fields to summarize.

Are there alternativeswhich i can try? - I've came across some methods using numpy, which might be more effecient in memory. But couldn't really see an efficient way to solve this for 40 columns.

I have searched online, but i couldn't find any answer. I'm new to pandas, so before i would do a deepdive into this topic, i just want to consult if i'm overlooking something or if there is an easier way to achieve what i want.

Because the dataframe has over 40 columns to group by and around 10 value fields, i have included these in two lists objects. This was the first hurdle i concequered thanks to the following stackoverflow page.

These list are then used in the groupby feature.

#A way i tried solving this, due to the limitation of only 9 variables if you enter them in your groupby functionality.

groupcolumns = ['aa','ab','ac','ad'] #etc
amountcolumns = ['z1', 'z2', 'z3', 'z4'] #etc

df1 = df.groupby(groupcolumns)[amountcolumns].sum
df1.reset_index()

I would expect that it would return a DataFrame which is summarized on the groupcolumns for the amount columns.

Would be great if anyone can help me out! Thanks in advance.

I think it's a problem with the data itself, but i's hard to say without evidence (data). Like, do you have missing values? How do you handle it? — powerPixie, Oct 20 '19 at 10:56
Try this `df1 = df.groupby(groupcolumns)[amountcolumns].sum()` If doesn't work provide more reproducible description. — Quant Christo, Oct 20 '19 at 11:34
@powerPixie It's an universal datamodel, which i can't share, but there are indeed some NaN values, because not always all the columns are populated. Could that be the case? That i should only include colomns with values? — Dubblej, Oct 20 '19 at 15:10
@powerPixie It was indeed one column which had NaN values into it, thank you for flagging this, will verify how i need to resolve this in future. — Dubblej, Oct 20 '19 at 15:33

score 0 · Answer 1 · answered Oct 20 '19 at 15:36

I noticed that one of the 40 columns had only null values.

By using the df.info() i then removed that field from the groupby and it works like a charm.

Perhaps good to share, this was only in the groupby(values), i also had some empty fields which which were included in the sum, these didn't provide any problemns.

Thank you @powerPixie!!

How to summarize large dataframes in python pandas (50 columns x 2m rows)

1 Answers1