S3 select - How can I query by non-standard timestamp comparison

Question

I'm using a S3 bucket where the data is organized into files by an ID & year/month - meaning one file per ID & month. In each (csv.gz) file each record has a timestamp in the format: YYYY-MM-dd HH:mm:ss (note the missing T).

Now, when querying the data I want to support datetime granularity down to seconds so naturally it's desired to filter the data in S3 already prior to managing the data in Python.

I can't however find any method to do this. The function TO_TIMESTAMP doesn't support a user provided format (expects a T date/time separator) and combining SUBSTRING and CAST (CAST(SUBSTRING(my_timestamp_column, 1, 10) AS TIMESTAMP)) yields a The query cannot be evaluated error.

Is there any way around this? The documentation states that the function TO_TIMESTAMP is "the inverse operation of TO_STRING" which is not quite true as the latter supports a time_format_pattern.

Samer A. · Answer 1 · 2020-12-14T22:04:16.507

I think I had to solve the same or similar problem. As you said, the documentation (https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html#s3-glacier-select-sql-reference-to-timestamp) states that the function TO_TIMESTAMP is the inverse operation of TO_STRING. But the documentation to me was misleading, because it does not make clear that the TO_TIMESTAMP function does support time_format_pattern as a second argument. The documentation shows that it only takes one argument, but it can in fact take two.

I was able to convert a non-standard timestamp 20190101T050000.000Z from type string to timestamp like so:

aws s3api select-object-content --bucket foo_bucket --key foo.json.gz --expression "SELECT * FROM s3object s WHERE TO_TIMESTAMP(s.\"timestamp\", 'yMMdd''T''Hmmss.SSS''Z''') < TO_TIMESTAMP('20190101T050000.000Z', 'yMMdd''T''Hmmss.SSS''Z''')" --expression-type 'SQL' --input-serialization '{ "CompressionType": "GZIP","JSON": {"Type": "DOCUMENT"}}' --output-serialization '{"JSON": {"RecordDelimiter": "\n"}}' /dev/shm/foo.json

Hope that helps somebody out.

score 0 · Answer 2 · answered Oct 17 '20 at 21:26

Having same issue over here, I went an step over and change my csv file to grant date field with require format by timestamp date type in S3 Select.The requiere format is described here S3 data types

So first, in order to response the question, based on S3 Select documentation, I think is not possible to work with a date without T at the end. By the time you correct that, you will be able to work with CAST function. Next is what I do:

select * from s3object as s where CAST('2020-01-01T' AS TIMESTAMP) < CAST('2021-01-01T' AS TIMESTAMP)

That works just okay, however as you can see, I'm not passing s."Date" which is the field header in my csv file due to following error:

Attempt to convert from one data type to another failed at line 1, column 39: cast from STRING to TIMESTAMP.

I hope have been help a little bit, and hope someone can help with this error.

S3 select - How can I query by non-standard timestamp comparison

2 Answers2