4

I'm curious about the differences of some methods of checking how big an xarray DataSet is as I decide whether or not I can successfully load it all into memory.

I open up a set of precipitation netcdfs with:

imerg_xds = xr.open_mfdataset('../data/IMERG/3B-DAY.MS.MRG.3IMERG.2018*.nc4',combine='by_coords', parallel=True)
imerg_xds

And the output is:

<xarray.Dataset>
Dimensions:                    (lat: 501, lon: 550, nv: 2, time: 303)
Coordinates:
  * lon                        (lon) float32 -84.95 -84.85 ... -30.049992
  * nv                         (nv) float32 0.0 1.0
  * lat                        (lat) float32 -35.05 -34.95 ... 14.950002
  * time                       (time) object 2018-02-01 00:00:00 ... 2018-11-30 00:00:00
Data variables:
    precipitationCal           (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    HQprecipitation            (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    precipitationCal_cnt       (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    randomError                (time, lon, lat) float32 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    randomError_cnt            (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    time_bnds                  (time, nv) object dask.array<chunksize=(1, 2), meta=np.ndarray>
    precipitationCal_cnt_cond  (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    HQprecipitation_cnt        (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
    HQprecipitation_cnt_cond   (time, lon, lat) int16 dask.array<chunksize=(1, 550, 501), meta=np.ndarray>
Attributes:
    BeginDate:       2018-02-01
    BeginTime:       00:00:00.000Z
    EndDate:         2018-02-01
    EndTime:         23:59:59.999Z
    FileHeader:      StartGranuleDateTime=2018-02-01T00:00:00.000Z;\nStopGran...
    InputPointer:    3B-HHR.MS.MRG.3IMERG.20180201-S000000-E002959.0000.V06B....
    title:           GPM IMERG Final Precipitation L3 1 day 0.1 degree x 0.1 ...
    DOI:             10.5067/GPM/IMERGDF/DAY/06
    ProductionTime:  2019-06-17T17:37:44.330Z
    history:         2019-10-22 15:41:09 GMT Hyrax-1.15.1 https://gpm1.gesdis...

Then I want to check the size so I use:

print("sys.getsizeof result:", sys.getsizeof(imerg_xds.HQprecipitation))
print("nbytes in MB:", imerg_xds.HQprecipitation.nbytes / (1024*1024))
print("DataArray.size in MB", imerg_xds.HQprecipitation.size / (1024*1024))

And the result is:

sys.getsizeof result: 96
nbytes in MB: 318.49536895751953
DataArray.size in MB 79.62384223937988

I assume because I'm using parallel=True that the dataset isn't actually in memory and this is why sys.getsizeof is so low but what is the difference between nbytes and DataArray.size? Is this possibly due to netcdf compressions DataArray.size is reporting the compressed size?

clifgray
  • 4,313
  • 11
  • 67
  • 116
  • `getsizeof` is not a reliable measure of size. It's ok when dealing `numpy` arrays, provided they aren't `views`. You have to have a clear idea of how the object is stored to get anything useful from it. – hpaulj Nov 04 '19 at 17:40
  • 2
    The `size` operation is returning the number of elements in the array, not the memory size! If you see the `nbytes` value is 4 times the `size` value.This is because the `HQprecipitation` field is a `float32`, as it can be confirmed from the main print output. So the mismatch between memory and size will depend on the type of the elements, e.g. each `float32` requires a 4 bytes hence `nbytes` = 4 `size`, `float64` requires 8 bytes so it is `nbytes` = 8 `size`, etc. – cvr Dec 18 '20 at 14:28

0 Answers0