0

I am working on a large csv files. Since I cannot import the whole csv file into a dataframe at the same time due to memory limitations, I am using chunks to process the data.

df = pd.read_csv(filepath, chunksize = chunksize)
for chunk in df:
    print(chunk['col2'].describe())

This gives me the stats for each chunk. Is there a way to merge the results from each chunk.describe() call to be merged so that I can get the stats for all data at once?

The only way I can think of right now is to maintain a dictionary to store the stats and update with each iteration.

newBie
  • 55
  • 1
  • 10
  • You can aggregate certain stats like `mean`, `max`, `min` and `count`s for the entire DataFrame from chunks, but I don't think you can get something like the 25th percentile from chunk statistics. – ALollz Mar 03 '20 at 02:28
  • `dask` wraps pandas for out-of-core (too large for memory) computation. It natively implements `describe`: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.describe – anon01 Mar 03 '20 at 05:25

1 Answers1

1

EDITED:

I got to playing around with this a little. I am new so take this with a grain of salt:

Load sample with remote source

import pandas as pd

df1_iter = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv", 
                       chunksize=5, 
                       iterator=True)

Do a simple for look to do .describe and .T on each chunk and append it to list

Next use pd.concat() on df_list

df_list = []

for chunk in df1_iter:
    df_list.append(chunk.describe().T)

df_concat = pd.concat(df_list)

Groupby
For the agg I used function I thought to be useful, adjust as needed.

desc_df = df_concat.groupby(df_concat.index).agg(
    {
        'mean':'mean', 
        'std': 'std',
        'min': 'min',
        '25%': 'mean', 
        '50%': 'mean', 
        '75%': 'mean', 
        'max': 'max'
    }
)

print(desc_df)
            mean        std     min         25%         50%         75%      max
am      0.433333   0.223607   0.000    0.333333    0.500000    0.500000    1.000
carb    3.100000   1.293135   1.000    2.250000    2.666667    4.083333    8.000
cyl     6.200000   0.636339   4.000    5.500000    6.000000    7.166667    8.000
disp  232.336667  40.954447  71.100  177.216667  195.233333  281.966667  472.000
drat    3.622833   0.161794   2.760    3.340417    3.649167    3.849583    4.930
gear    3.783333   0.239882   3.000    3.541667    3.916667    3.958333    5.000
hp    158.733333  44.053017  52.000  124.416667  139.333333  191.083333  335.000
mpg    19.753333   2.968229  10.400   16.583333   20.950000   23.133333   33.900
qsec   17.747000   0.868257  14.500   16.948333   17.808333   18.248333   22.900
vs      0.450000   0.102315   0.000    0.208333    0.416667    0.625000    1.000
wt      3.266900   0.598493   1.513    2.850417    3.042500    3.809583    5.424

I hope this was helpful.

Ukrainian-serge
  • 854
  • 7
  • 12
  • 1
    This was helpful but kept getting errors because of the next() probably due to the memory usage. `File "pandas\_libs\parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862) File "pandas\_libs\parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)`. Trying to figure out a way to initiate a dataframe for describe without using the `next` to avoid that. – newBie Mar 03 '20 at 17:28
  • My example is simple and small. How large is your file? – Ukrainian-serge Mar 03 '20 at 17:49
  • The file is 348MB. – newBie Mar 03 '20 at 17:56
  • 1
    It wasn't a memory issue, the way I was using `next` was throwing an exception when there were no chunks left. Your approach works. Thanks @Ukrainian-serge – newBie Mar 03 '20 at 23:11