1

My results are stored in Amazon S3 in parquet format.

My Requirements are as follows :

  1. I have a S3 bucket where I store my result as parquet (multiple parquet parts). I want to retrieve the results in all the parts.
  2. I want to retrieve all rows (in all the parts) as they are. (Doing query would be nice)
  3. My desire to paginate comes from my environment which is non distributed. I have an EC2 instance that has java code to get the results. I need the results to be paginated so that the EC2 host does not crash while retrieving the result.

Options I looked into:

  1. ListObjectsV2Request - can't use this yet because we have not upgraded to AWS Java SDK 2.0

  2. Looking into S3 Select - Since S3 select needs the exact key of the contents I want to retrieve, first I will have to list all the parts from S3 and then use S3 Select on each part to get the results. Also I am not sure how I will paginate the input stream provided by S3

  3. Also looking into Read parquet data from AWS s3 bucket but I am not clear on how to paginate the results.

Any input/help will be highly appreciated.

abc123
  • 527
  • 5
  • 16
  • 1
    Amazon S3 Select would be a great option if it fits your needs. Are you wanting to retrieve the _entire_ contents, or are you wanting to perform some logic to obtain only certain rows? Is the data in a single S3 object, or in multiple objects within a directory? Is your desire to paginate because you think there will be too many results provided (how many rows are there)? What do you mean by first having to "list all the parts from S3"? Feel free to edit your question to add more details. – John Rotenstein May 14 '19 at 00:57
  • Hi @JohnRotenstein I updated the post to answer all your questions. Please let me know if anything is not clear. – abc123 May 14 '19 at 01:44

1 Answers1

2

This sounds like an excellent use-case for Amazon Athena. It can:

  • Read Parquet files
  • Treat multiple files in a directory as a single source of data
  • Allow querying of data to only retrieve desired results (it can also JOIN tables)
  • It can return paginated results

See:

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470