I have a big json document stored in S3 with a structure like this:
{ "result": {
"id": "123",
"commits": ["comm1", "comm2", ..., "commN"]
}
}
There are other fields there too and the number of commits can go into thousands.
When I use S3 select like this it only gives me about 20 commits, not thousands as I expect.
aws s3api select-object-content \
--bucket my-bucket --key "path/result.json.gz" \
--expression "select res.id, res.commits from S3Object[*].result res" \
--expression-type 'SQL' \
--input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "GZIP"}' \
--output-serialization '{"JSON": {}}' /dev/stdout | jq
returns
{
"id": "b3496828e23f051c1f8c0ec9a670423e36710d4c",
"commits": [
"3d6687d0f2a730a4ba38d6168c6beb9e7b1ca6c2",
"70eb3bd892ee57ee83c784885bac251712e4bf44",
"e572935c76f94a1762c5910b17dd175db408007d",
"daf2dc6ecee7b532e9e7ceca49a59e8bb272720c",
"3518b2891ae90ec746a534af7c3ba70a8ed3e7d3",
"96664157a561874818ceb28a011433b5d591d7a7",
"c9b8d33ed435e5c4beb136af0561ec648c2de562",
"61f0260b50574ff77c6d9259e5fe90124804e8d2",
"a5bcbe673c6a09be84b308221dc225e9a71f160b",
"c0003d8c85e4e545dc63dd0857660506dfe51eeb",
"c6133dbddfdbd38b64ef6d8eaa9526c5f49ceca8",
"f607ff7504bd2b3e0190075fcf43f5c5f2943763",
"76295d185bba93c456c7840be94bdee44cf7521e",
"332ccf2976faa5f33fe39216108cbd61c376b049",
"5df9c9fad72e51f851da446a2a64003efe1641e1",
"c3433e8073883917004e16cea632837d8b7e11d0",
"18d10f27cbf2e5a73b24bc6f9986b99b4d130b6e",
"a09944120f0dd0fb3c8ad1927fe0555a28864314",
"958e65ea27eaefbabb83ddcf4393baa29efba9ef",
"ae4cbb9158d5983ae088191da1bac4c1c06b3c19"
]
}
The object is compressed with gzip and its size is about 4MB - decompressed 69 MB. Reading the docs, it seems it's a limitation of S3 Select, but it's not clear to me what "record" means in the docs: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html#selecting-content-from-objects-requirements-and-limits
The maximum length of a record in the input or result is 1 MB.
I'm not getting any error, just more-or-less randomly trimmed data. When I try to query some other fields in the input JSON object (not shown in the sample shared here) then I also get truncated data - just a different number of elements.
It is confusing, notably because I found another question, where they claim to get "OverMaxRecordSize" error when they go over the limit.
I asked ChatGPT about this and it looked promising first but I quickly realized it must be halucinating (at least, couldn't find anything like --payload
, MaxPayloadInBytes
, or even aws s3 select
in the official docs):
It's possible that you are hitting a limit on the amount of data that can be returned by S3 Select, which is 1 MB by default. This means that if your array of thousands of elements exceeds 1 MB in size, only the first 1 MB of the array will be returned by S3 Select. To get around this, you can increase the maximum payload size using the --payload option when running the S3 Select query. For example, to increase the payload size to 5 MB, you can run the following command:
aws s3 select --bucket your-bucket --key your-key --expression "SELECT your_array FROM S3Object" --input-serialization '{ "CompressionType": "GZIP", "JSON": { "Type": "DOCUMENT" } }' --output-serialization '{"JSON": {"RecordDelimiter": "\n"}}' --payload '{"S3SelectParameters": {"MaxPayloadInBytes": 5000000}}'
Note that increasing the payload size may increase the query execution time and consume more memory. You may need to experiment with different payload sizes to find the optimal balance between query performance and result size.