2

I am using AWS Java SDK in my application to talk to one of my S3 buckets which holds objects in JSON format.

A document may look like this:

{
    "a" : dataA,
    "b" : dataB,
    "c" : dataC,
    "d" : dataD,
    "e" : dataE
} 

Now, for a certain document lets say document1 I need to fetch the values corresponding to field a and b instead of fetching the entire document.

This sounds like something that wouldn't be possible because S3 buckets can have any type of documents in them and not just JSONs.

Is this something that is achievable though?

madhead
  • 31,729
  • 16
  • 153
  • 201
Kunal gupta
  • 481
  • 2
  • 7
  • 19
  • 1
    As far as I know S3 just deals with blobs (those could be binary data or text) and as such doesn't provide means to parse the bucket contents on S3 itself. Thus you'd need to transfer it somewhere else for the parsing and extraction, e.g. a Lambda. Depending on your needs you might also want to consider a different layout of your buckets (e.g. use smaller or more specific buckets) or use something else e.g. a DynamoDB. – Thomas Apr 29 '21 at 10:42
  • 2
    You can query the content of the S3 objects. AWS Java SDK supported that. Please see below link for your reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html – Rodel May 10 '21 at 06:51
  • 1
    S3 selects supports only "CSV, JSON, or Parquet format". So, the you can query as required. Querying is just similar to sql query. – Pavan Feb 08 '22 at 06:00

1 Answers1

4

That's actually doable. You could do selects like you've described, but only for particular formats: JSON, CSV, Parquet.

Imagine having a data.json file in so67315601 bucket in eu-central-1:

{
  "a": "dataA",
  "b": "dataB",
  "c": "dataC",
  "d": "dataD",
  "e": "dataE"
}

First, learn how to select the fields via the S3 Console. Use "Object Actions" → "Query with S3 Select":

enter image description here enter image description here


AWS Java SDK 1.x

Here is the code to do the select with AWS Java SDK 1.x:

@ExtendWith(S3.class)
class SelectTest {
    @AWSClient(endpoint = Endpoint.class)
    private AmazonS3 client;

    @Test
    void test() throws IOException {
        // LINES: Each line in the input data contains a single JSON object
        // DOCUMENT: A single JSON object can span multiple lines in the input
        final JSONInput input = new JSONInput();
        input.setType(JSONType.DOCUMENT);

        // Configure input format and compression
        final InputSerialization inputSerialization = new InputSerialization();
        inputSerialization.setJson(input);
        inputSerialization.setCompressionType(CompressionType.NONE);

        // Configure output format
        final OutputSerialization outputSerialization = new OutputSerialization();
        outputSerialization.setJson(new JSONOutput());

        // Build the request
        final SelectObjectContentRequest request = new SelectObjectContentRequest();
        request.setBucketName("so67315601");
        request.setKey("data.json");
        request.setExpression("SELECT s.a, s.b FROM s3object s LIMIT 5");
        request.setExpressionType(ExpressionType.SQL);
        request.setInputSerialization(inputSerialization);
        request.setOutputSerialization(outputSerialization);

        // Run the query
        final SelectObjectContentResult result = client.selectObjectContent(request);

        // Parse the results
        final InputStream stream = result.getPayload().getRecordsInputStream();

        IOUtils.copy(stream, System.out);
    }
}

The output is:

{"a":"dataA","b":"dataB"}

AWS Java SDK 2.x

The code for the AWS Java SDK 2.x is more cunning. Refer to this ticket for more information.

@ExtendWith(S3.class)
class SelectTest {
    @AWSClient(endpoint = Endpoint.class)
    private S3AsyncClient client;

    @Test
    void test() throws Exception {
        final InputSerialization inputSerialization = InputSerialization
            .builder()
            .json(JSONInput.builder().type(JSONType.DOCUMENT).build())
            .compressionType(CompressionType.NONE)
            .build();

        final OutputSerialization outputSerialization = OutputSerialization.builder()
            .json(JSONOutput.builder().build())
            .build();

        final SelectObjectContentRequest select = SelectObjectContentRequest.builder()
            .bucket("so67315601")
            .key("data.json")
            .expression("SELECT s.a, s.b FROM s3object s LIMIT 5")
            .expressionType(ExpressionType.SQL)
            .inputSerialization(inputSerialization)
            .outputSerialization(outputSerialization)
            .build();
        final TestHandler handler = new TestHandler();

        client.selectObjectContent(select, handler).get();

        RecordsEvent response = (RecordsEvent) handler.receivedEvents.stream()
            .filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.RECORDS)
            .findFirst()
            .orElse(null);

        System.out.println(response.payload().asUtf8String());
    }

    private static class TestHandler implements SelectObjectContentResponseHandler {
        private SelectObjectContentResponse response;
        private List<SelectObjectContentEventStream> receivedEvents = new ArrayList<>();
        private Throwable exception;

        @Override
        public void responseReceived(SelectObjectContentResponse response) {
            this.response = response;
        }

        @Override
        public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
            publisher.subscribe(receivedEvents::add);
        }

        @Override
        public void exceptionOccurred(Throwable throwable) {
            exception = throwable;
        }

        @Override
        public void complete() {
        }
    }
}

As you see, it's possible to make S3 selects programmatically!

You might be wondering what are those @AWSClient and @ExtendWith( S3.class )?

This is a small library to inject AWS clients in your tests, named aws-junit5. It would greatly simplify your tests. I am the author. The usage is really simple — try it in your next project!

madhead
  • 31,729
  • 16
  • 153
  • 201
  • 1
    Works like a charm. aws-junit sounds like a good read too. I am gonna check it out. :) – Kunal gupta Feb 09 '22 at 16:51
  • Thanks for the post on AWS Java SDK 2.x. It mostly works. My returned object is truncated. Is there some limit on how much data returned? – bostonjava Dec 20 '22 at 16:42