2

I have a csv file in the format

IDATE_TIMESTAMP,OPEN,HIGH,LOW,CLOSE,VOLUME
1535535060,94.36,94.36,94.36,94.36,1
1535535120,94.36,94.36,93.8,93.8,1
1535535180,93.8,93.8,93.8,93.8,0
1535535240,93.8,93.8,93.74,93.74,1
1535535300,93.74,93.74,93.74,93.74,0
1535535360,93.74,93.74,93.74,93.74,0
1535535420,93.74,93.74,93.74,93.74,0
1535535480,93.74,93.74,93.74,93.74,0
1535535540,93.74,93.74,93.74,93.74,0
.
.
.
.

I have to and from timestamp which will filter out the data from the file and return the output. I am using python + boto3 for s3 select.

fromTs = "1535535480"
toTs = "1535535480"
query = """SELECT * FROM s3object s WHERE s."IDATE_TIMESTAMP" >= "%s" AND s."IDATE_TIMESTAMP" <= "%s" """%(fromTs, toTs)
request = client.select_object_content(
        Bucket=bucket,
        Key=filename,
        ExpressionType="SQL",
        Expression=query,
        InputSerialization={"CSV":{"FileHeaderInfo":"Use", "FieldDelimiter":",", "RecordDelimiter":"\n"}},
        OutputSerialization={"CSV":{}},
    )

botocore.exceptions.ClientError: An error occurred (MissingHeaders) when calling the SelectObjectContent operation: Some headers in the query are missing from the file. Please check the file and try again.

This is error i am getting

2 Answers2

0

I know this is a bit late and might not be the solution to your issue, but I was having a similar one.

My issue turned out to be that I was attempting to perform an S3 Select on an object with UTF-8-BOM encoding, rather than just UTF-8. It turns out that the 3 byte BOM header was being interpreted as part of the first field in the CSV object, essentially corrupting the first column name.

As a result, rather than "IDATE_TIMESTAMP", the first column would be seen by the S3 Select call as "xxxIDATE_TIMESTAMP", causing an error when your expected column is "missing".

KC54
  • 231
  • 4
  • 8
0

The timestamp columns need to be casted as int. The following query:

fromTs = 1535535480
toTs = 1535535480
SELECT * FROM s3object s 
where cast(s.IDATE_TIMESTAMP as int) >= {} 
AND cast(s.IDATE_TIMESTAMP as int) <= {}".format(fromTs, toTs) 

would work in python 3.

Mohnish
  • 1,010
  • 1
  • 12
  • 20
arp5
  • 169
  • 10