1

I am fetching records from the s3 JSON file using s3 select. Everything working for me when I fetch data from small JSON files ie 2MB(with record count around 10000)

Following is my query

innerStart = 1
innerStop = 100
maximumLimit = 100
query = "SELECT * FROM s3object r where r.id > " + str(innerStart) + " and r.id <= " + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
     Bucket=cache,
     Key= key + '.json',
     ExpressionType='SQL',
     Expression= query,
     InputSerialization={'JSON': {"Type": "Lines"}, 'CompressionType': 'NONE'},
     OutputSerialization={'JSON': {
     }},
)

But when I try to query some records from large JSON files (ie 100 MB over 578496 records). I am getting the following error. I tried changing my query to fetch an only a single record from a large JSON file also didn't work for me. Is there any Scan character limit for S3 Select?

File "./app/main.py", line 118, in retrieve_from_cache_json OutputSerialization={'JSON': { File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (OverMaxRecordSize) when calling the SelectObjectContent operation: The character number in one record is more than our max threshold, maxCharsPerRecord: 1,048,576

Sample JSON file

{
        "id": 1,
        "hostname": "registry.in.",
        "subtype": "A",
        "value": "5.9.139.185",
        "passive_dns_count": "4",
        "count_total": 11,
        "count": 11
    }
    {
        "id": 2,
        "hostname": "registry.ctn.in.",
        "subtype": "A",
        "value": "18.195.87.188",
        "passive_dns_count": "2",
        "count_total": 11,
        "count": 11
    }
    
        "id": 3,
        "hostname": "registry.in.",
        "subtype": "NS",
        "value": "ns-243.awsdns-30.com.",
        "passive_dns_count": "6",
        "count_total": 11,
        "count": 11
    }
    ...
    ...
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Dibish
  • 9,133
  • 22
  • 64
  • 106
  • I think you are hitting this limit: `The maximum length of a record in the input or result is 1 MB.`. How big are your records? – Marcin May 28 '21 at 05:29
  • I am sure it's not the result, because I tried fetching only a single record. Do you know what is "The maximum length of a record in the input". My json file has 578496 records, but I am limiting the record count as 100, but s3 cannot able to scan large JSON. – Dibish May 28 '21 at 05:47
  • I changed the JSON schema to CSV and it worked for me. – Dibish Jun 05 '21 at 11:00

1 Answers1

1

I changed my JSON schema to CSV, csv select worked for me. Following is my query

innerStop = 100
innerStart = 0
maximumLimit = 100
query = "SELECT * FROM s3Object r WHERE cast(r.\"id\" as float) > " + str(innerStart) + " and cast(r.\"id\" as float) <=" + str(innerStop) + " limit " + str(maximumLimit);
    r = s3.select_object_content(
         Bucket=cache,
         Key= 'filename' + '.csv',
         ExpressionType='SQL',
         Expression= query,
         InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
         OutputSerialization = {'CSV': {}},
    )
    for event in r['Payload']:
     if 'Records' in event:
         records = event['Records']['Payload'].decode('utf-8')
Dibish
  • 9,133
  • 22
  • 64
  • 106