0

I want to read a selected list of Parquet files from AWS S3. I know how to read all files in a directory using *parquet or just one single file by specifying just that key. However I would like to read only a specific list of files based on some prior user input.

Is this possible?

The following code is from their API Docs but does not address my requirement:

import dask.dataframe as dd

df = dd.read_parquet('s3://bucket/path/to/data-*.parque')
(OR)
df = dd.read_parquet('s3://bucket/path/to/file.parque')

Is there a way to pass in a list of target files in the read_parquet parameters instead?

Gowthaman
  • 59
  • 9

1 Answers1

1

Using Boto3, find all object keys, and then list all objects that you require and create a list with those objects and pass them in a for loop to the DFs

Using S3fs you can list objects like you can in Linux, you can store all the object names in a list and then pass it one by one in a for loop to the DF

More on Boto3 Getting specific objects: Boto3: grabbing only selected objects from the S3 resource

Source for s3fs: https://medium.com/swlh/using-s3-just-like-a-local-file-system-in-python-497737783f11

EngineJanwaar
  • 422
  • 1
  • 7
  • 14
  • I am working on UI right now. Will try your solution towards the end of this week and respond accordingly. Thanks a lot though.... – Gowthaman Aug 26 '19 at 12:11
  • Your answer was a good start for me. However when using Dask framework we can pass the whole list as input parameter to create a comprehensive Dask Data Frame. – Gowthaman Sep 03 '19 at 04:38