Data science workflow with large geospatial datasets

Question

I am relatively new to the docker approach so please bear with me.

The goal is to ingest large geospatial datasets to Google Earth Engine using an open source replicable approach. I got everything working on my local machine and a Google Compute Engine but would like to make the approach accessible to others as well.

The large static geospatial files (NETCDF4) are currently stored on Amazon S3 and Google Cloud Storage (GEOTIFF). I need a couple of python based modules to convert and ingest the data into Earth Engine using a command line interface. This has to happen only once. The data conversion is not very heavy and can be done by one fat instance (32GB RAM, 16 cores takes 2 hours), there is no need for a cluster.

My question is how I should deal with large static datasets in Docker. I thought of the following option but would like to know best practices.

1) Use docker and mount the amazon s3 and Google Cloud Storage buckets to the docker container.

2) Copy the large datasets to a docker image and use Amazon ECS

3) just use the AWS CLI

4) use Boto3 in Python

5) A fifth option that I am not yet aware of

The python modules that I use are a.o.: python-GDAL, pandas, earth-engine, subprocess

I found this related question: https://stackoverflow.com/questions/35189251/docker-mount-s3-container — Rutger Hofste, Jun 28 '17 at 10:19
and this github repo: https://github.com/s3fs-fuse/s3fs-fuse — Rutger Hofste, Jun 28 '17 at 10:20
I'm curious to see the whole process. Do you have it up on github? — Philip Blankenau, May 24 '19 at 16:20

Data science workflow with large geospatial datasets

0 Answers0