Very large data sets are typically analysed with the help of distributed processing tools such as Apache Hadoop (which is available as part of the Amazon EMR service). Hadoop can split processing between multiple servers (nodes), achieving much better speed and throughput by working in parallel.
I took a look at one of the data set directories and found these files:
$ aws s3 ls s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/
2013-09-29 17:58:42 1344734800 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
2013-10-09 05:08:17 83 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc.md5
2013-09-29 18:18:00 1344715511 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc
2013-10-09 05:14:49 83 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc.md5
2013-09-29 18:15:33 1344778298 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc
2013-10-09 05:17:37 83 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc.md5
2013-09-29 18:20:42 1344775120 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc
2013-10-09 05:07:30 83 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc.md5
...
Each data file in this directory is 1.3GB (together with an MD5 file to verify file contents via a checksum). As of the 9th of August 2023, the CONUS
folder contains 152 files (~95GB).
I downloaded one of these files:
$ aws s3 cp s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc .
Completed 160 of 160 part(s) with 1 file(s) remaining
The aws s3 cp
command used multi-part download to retrieve the file. Depending on your internet connection speed, this may take a few seconds.
The result is a local file that can be accessed via Python:
$ ls -l
total 1313244
-rw-rw-r-- 1 ec2-user ec2-user 1344734800 Sep 29 2013 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
It is in .nc
format, which I think is a NetCDF.
I recommend processing few files at a time in line with your volume size, considering EBS data volumes have 16TiB maximum size.