Amazon S3 Select: getting truncated data when querying large JSON object

Question

I have a big json document stored in S3 with a structure like this:

{ "result": {
     "id": "123",
     "commits": ["comm1", "comm2", ..., "commN"]
   }
 }

There are other fields there too and the number of commits can go into thousands.

When I use S3 select like this it only gives me about 20 commits, not thousands as I expect.

aws s3api select-object-content \
    --bucket my-bucket --key "path/result.json.gz" \
    --expression "select res.id, res.commits from S3Object[*].result res" \
    --expression-type 'SQL' \
    --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "GZIP"}' \
    --output-serialization '{"JSON": {}}' /dev/stdout | jq

returns

{
  "id": "b3496828e23f051c1f8c0ec9a670423e36710d4c",
  "commits": [
    "3d6687d0f2a730a4ba38d6168c6beb9e7b1ca6c2",
    "70eb3bd892ee57ee83c784885bac251712e4bf44",
    "e572935c76f94a1762c5910b17dd175db408007d",
    "daf2dc6ecee7b532e9e7ceca49a59e8bb272720c",
    "3518b2891ae90ec746a534af7c3ba70a8ed3e7d3",
    "96664157a561874818ceb28a011433b5d591d7a7",
    "c9b8d33ed435e5c4beb136af0561ec648c2de562",
    "61f0260b50574ff77c6d9259e5fe90124804e8d2",
    "a5bcbe673c6a09be84b308221dc225e9a71f160b",
    "c0003d8c85e4e545dc63dd0857660506dfe51eeb",
    "c6133dbddfdbd38b64ef6d8eaa9526c5f49ceca8",
    "f607ff7504bd2b3e0190075fcf43f5c5f2943763",
    "76295d185bba93c456c7840be94bdee44cf7521e",
    "332ccf2976faa5f33fe39216108cbd61c376b049",
    "5df9c9fad72e51f851da446a2a64003efe1641e1",
    "c3433e8073883917004e16cea632837d8b7e11d0",
    "18d10f27cbf2e5a73b24bc6f9986b99b4d130b6e",
    "a09944120f0dd0fb3c8ad1927fe0555a28864314",
    "958e65ea27eaefbabb83ddcf4393baa29efba9ef",
    "ae4cbb9158d5983ae088191da1bac4c1c06b3c19"
  ]
}

The object is compressed with gzip and its size is about 4MB - decompressed 69 MB. Reading the docs, it seems it's a limitation of S3 Select, but it's not clear to me what "record" means in the docs: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html#selecting-content-from-objects-requirements-and-limits

The maximum length of a record in the input or result is 1 MB.

I'm not getting any error, just more-or-less randomly trimmed data. When I try to query some other fields in the input JSON object (not shown in the sample shared here) then I also get truncated data - just a different number of elements.

It is confusing, notably because I found another question, where they claim to get "OverMaxRecordSize" error when they go over the limit.

I asked ChatGPT about this and it looked promising first but I quickly realized it must be halucinating (at least, couldn't find anything like --payload, MaxPayloadInBytes, or even aws s3 select in the official docs):

It's possible that you are hitting a limit on the amount of data that can be returned by S3 Select, which is 1 MB by default. This means that if your array of thousands of elements exceeds 1 MB in size, only the first 1 MB of the array will be returned by S3 Select. To get around this, you can increase the maximum payload size using the --payload option when running the S3 Select query. For example, to increase the payload size to 5 MB, you can run the following command:

aws s3 select --bucket your-bucket --key your-key --expression "SELECT your_array FROM S3Object" --input-serialization '{   "CompressionType": "GZIP",   "JSON": {     "Type": "DOCUMENT"   } }' --output-serialization '{"JSON": {"RecordDelimiter": "\n"}}' --payload '{"S3SelectParameters": {"MaxPayloadInBytes": 5000000}}'

Note that increasing the payload size may increase the query execution time and consume more memory. You may need to experiment with different payload sizes to find the optimal balance between query performance and result size.

A 'record' would likely be one line of data. You might need to switch to using Amazon Athena instead of S3 Select. — John Rotenstein, May 13 '23 at 23:11
"line" as one record looks a bit strange. What if the JSON object isn't written as one very long line but instead spans across many lines? Would it be considered multiple records then? I don't want to switch to Athena but it's much more complicated to use - it was my hope to achieve good savings with a simple solution. — Juraj Martinka, May 14 '23 at 06:33
Presumably it is using [JSON Lines](https://jsonlines.org/) format. I don't know whether it supports newlines inside a field. See: [Process JSON, containing new lines, with JQ and bash - Unix & Linux Stack Exchange](https://unix.stackexchange.com/questions/472514/process-json-containing-new-lines-with-jq-and-bash) — John Rotenstein, May 14 '23 at 06:50
Thanks for the hint. I tried to pretty print the object, store the pretty-printed version in S3 and query it but then I got another error: `An error occurred (OverMaxRecordSize) when calling the SelectObjectContent operation: The character number in one record is more than our max threshold, maxCharsPerRecord: 1,048,576` I guess there's just no way around this limitation :(. — Juraj Martinka, May 14 '23 at 20:08

Amazon S3 Select: getting truncated data when querying large JSON object

0 Answers0