Working with AWS S3 Large Public Data Set

Question

AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to get an S3 data set within an EC2 with reasonable enough speed to actually work with the data.

So my issue is getting the public big data sets (~256T) "into" an EC2. One approach I tried was to mount the public S3 to my EC2, as in this tutorial. However, when attempting to use python to evaluate this mounted data, the processing times were very, very slow.

I am starting to think utilizing the AWS CLI (cp or sync) may be the correct approach, but am still having difficulty finding documentation on this with respect to large, public S3 data sets.

In short, is mounting the best way to work with AWS' S3 public big data sets, is the CLI better, is this an EMR problem, or does the issue lie entirely in instance size and / or bandwidth?

256T getting on an EC2 will take huge time and will be too expensive too why are you not keeping that in S3? — Piyush Patil, Jul 22 '16 at 19:51
I don't understand, do you use EMR with multiple instances or just one instance ? 256T is really a huge amout of data for only one instance, do you use a Hadoop or/and Spark or other framework to process the data ? — ar-ms, Jul 22 '16 at 19:59
@error2007s This is a public data set, and I need to be able to use python to analyze the data. Mounting to an EC2 is fairly straightforward, but when analyzing the data, it is very slow. My question is about this connection. I plan to leave it in an S3, but I don't know how to connect to the data to be able to process it. — csg2136, Jul 22 '16 at 20:08
So you want to increase the Python Processing speed right for analyzing the Data set? — Piyush Patil, Jul 22 '16 at 20:09
@Koffee I do not use EMR and I am not familiar with the Hadoop framework. That would indeed be my next step, but if there is any way to connect to the data from a single instance without cluster, that would be ideal, given my inexperience with those frameworks. Do you think EMR is the only way to handle this amount of data? I do not need all 256T per se, but need to be able to take sub-sets of the data files, which are stored at netcdfs. — csg2136, Jul 22 '16 at 20:11
@error2007s I'm less concerned with the python processing speed itself, and more with the amount of time it takes to load the data from the mounted S3. For instance, once I mount the public S3, I use python to load one of the data files (`xarray.open_dataset('xyz.nc'`). The time it takes to simply load that file into python indicates that it is far too slow to reasonably be able to do any kind of analysis from this dataset with multiple files. — csg2136, Jul 22 '16 at 20:15
s3fs is not really an appropriate tool, here. It takes something that is not a filesystem (S3) and makes a valiant effort towards imposing filesystem semantics on it -- but there can be no perfect implementation of a bridge across such an impedance gap. At the risk of asking the obviois, is your EC2 imstance in us-east-1? Because iirc that is where the public data sets are stored. Performance in another region would be dismal by comparison. — Michael - sqlbot, Jul 23 '16 at 04:41

score 2 · Answer 1 · edited Aug 09 '23 at 11:06

Very large data sets are typically analysed with the help of distributed processing tools such as Apache Hadoop (which is available as part of the Amazon EMR service). Hadoop can split processing between multiple servers (nodes), achieving much better speed and throughput by working in parallel.

I took a look at one of the data set directories and found these files:

$ aws s3 ls s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/

2013-09-29 17:58:42 1344734800 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
2013-10-09 05:08:17         83 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc.md5
2013-09-29 18:18:00 1344715511 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc
2013-10-09 05:14:49         83 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc.md5
2013-09-29 18:15:33 1344778298 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc
2013-10-09 05:17:37         83 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc.md5
2013-09-29 18:20:42 1344775120 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc
2013-10-09 05:07:30         83 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc.md5
...

Each data file in this directory is 1.3GB (together with an MD5 file to verify file contents via a checksum). As of the 9th of August 2023, the CONUS folder contains 152 files (~95GB).

I downloaded one of these files:

$ aws s3 cp s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc .
Completed 160 of 160 part(s) with 1 file(s) remaining

The aws s3 cp command used multi-part download to retrieve the file. Depending on your internet connection speed, this may take a few seconds.

The result is a local file that can be accessed via Python:

$ ls -l
total 1313244
-rw-rw-r-- 1 ec2-user ec2-user 1344734800 Sep 29  2013 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc

It is in .nc format, which I think is a NetCDF.

I recommend processing few files at a time in line with your volume size, considering EBS data volumes have 16TiB maximum size.

These files are 1.3GB not TB. – Remi D Aug 06 '21 at 08:52 — Remi D, Aug 06 '21 at 08:52

Working with AWS S3 Large Public Data Set

1 Answers1