How to read parquet file from s3 on AWS lambda with Java?

Question

private final S3Client s3Client = S3Client.builder().build();
  @Override
  public Void handleRequest(SQSEvent event, Context context) {
    for (SQSMessage msg : event.getRecords()) {
      S3Event s3Event = gson.fromJson(msg.getBody(),S3Event.class);
      S3EventNotificationRecord s3EventRecord = s3Event.getRecords().get(0);
      String bucketName = s3EventRecord.getS3().getBucket().getName();
      String objectKey = URLDecoder.decode(s3EventRecord.getS3().getObject().getKey(), StandardCharsets.UTF_8);
      GetObjectRequest getObjectRequest = GetObjectRequest.builder()
          .bucket(bucketName)
          .key(bucketName)
          .build();
      InputStream inputStream = s3Client.getObject(getObjectRequest, ResponseTransformer.toBytes()).asInputStream();

Hi all, I have the above code to read an SQS event containing a s3 event notification containing the object and bucket of the file drop which is in parquet format. I want to read the file and process each record to a queue.

I have gotten to grabbing the object and returning it as an inputstream to read from but looking around for a parquet library I haven't found anything that allows me to directly read the parquet file from the inputstream without leveraging hadoop or spark libraries.

Is there any other way to get the individual records in the parquet file?

This would be trivial if you switched to Python, and used the AWS SDK for Pandas: https://aws-sdk-pandas.readthedocs.io/en/stable/ — Mark B, Oct 14 '22 at 20:04

How to read parquet file from s3 on AWS lambda with Java?

0 Answers0