I'm curious about the differences of some methods of checking how big an xarray DataSet is as I decide whether or not I can successfully load it all into memory.
I open up a set of precipitation netcdfs with:
imerg_xds = xr.open_mfdataset('../data/IMERG/3B-DAY.MS.MRG.3IMERG.2018*.nc4',combine='by_coords', parallel=True)
imerg_xds
And the output is:
<xarray.Dataset>
Dimensions: (lat: 501, lon: 550, nv: 2, time: 303)
Coordinates:
* lon (lon) float32 -84.95 -84.85 ... -30.049992
* nv (nv) float32 0.0 1.0
* lat (lat) float32 -35.05 -34.95 ... 14.950002
* time (time) object 2018-02-01 00:00:00 ... 2018-11-30 00:00:00
Data variables:
precipitationCal (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
HQprecipitation (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
precipitationCal_cnt (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
randomError (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
randomError_cnt (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
time_bnds (time, nv) object dask.array<chunksize=(1, 2), meta=np.ndarray>
precipitationCal_cnt_cond (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
HQprecipitation_cnt (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
HQprecipitation_cnt_cond (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
Attributes:
BeginDate: 2018-02-01
BeginTime: 00:00:00.000Z
EndDate: 2018-02-01
EndTime: 23:59:59.999Z
FileHeader: StartGranuleDateTime=2018-02-01T00:00:00.000Z;\nStopGran...
InputPointer: 3B-HHR.MS.MRG.3IMERG.20180201-S000000-E002959.0000.V06B....
title: GPM IMERG Final Precipitation L3 1 day 0.1 degree x 0.1 ...
DOI: 10.5067/GPM/IMERGDF/DAY/06
ProductionTime: 2019-06-17T17:37:44.330Z
history: 2019-10-22 15:41:09 GMT Hyrax-1.15.1 https://gpm1.gesdis...
Then I want to check the size so I use:
print("sys.getsizeof result:", sys.getsizeof(imerg_xds.HQprecipitation))
print("nbytes in MB:", imerg_xds.HQprecipitation.nbytes / (1024*1024))
print("DataArray.size in MB", imerg_xds.HQprecipitation.size / (1024*1024))
And the result is:
sys.getsizeof result: 96
nbytes in MB: 318.49536895751953
DataArray.size in MB 79.62384223937988
I assume because I'm using parallel=True that the dataset isn't actually in memory and this is why sys.getsizeof is so low but what is the difference between nbytes and DataArray.size? Is this possibly due to netcdf compressions DataArray.size is reporting the compressed size?