can't load parquet timestamp using Synapse serverless pool's OPENROWSET

Question

TL;DR

The first possible parquet timestamp, 0001-01-01 00:00:00.000 cannot be loaded with Azure Synapse Serverless SQL Pools, as DATETIME2 or any other type. I'm baffled bc the DATETIME2 docs say that this is within the acceptable range:

Property Value

Date range 0001-01-01 through 9999-12-31 January 1,1 CE through December 31, 9999 CE

Time range 00:00:00 through 23:59:59.9999999

user @RandolphWest found this relevant thread that gets the same error message but they lay the blame on ADF.

Reproducible Example

Here's how to make a file that will reproduce the error. Alternatively, here's a Dropbox link to the file

Make a parquet file

from pyspark.sql.functions import to_timestamp
dfx = (
    spark.createDataFrame(
        data = [ ("1","0001-01-01 00:00:00.000")],
        schema=["id","my_ts"])
    .withColumn("my_ts",to_timestamp("my_ts"))
)
dfx.write.parquet('abfss://<MY-CONTAINER>@<MY-WORKSPACE>.dfs.core.windows.net/dfx')
display(dfx)

+---+-------------------+
| id|              my_ts|
+---+-------------------+
|  1|0001-01-01 00:00:00|
+---+-------------------+

Load a parquet file

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/**',
        FORMAT = 'PARQUET'
    ) AS [result]

Error message

Failed to execute query.
Error: Error handling external file: 'Inserting value to batch for column type DATETIME2 failed.
Invalid argument provided.'.
File/External table name: 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/part-00015-f98c11bd-ffb4-45de-b505-0aa226183098-c000.snappy.parquet'.
Statement ID: {F3062AA6-DD96-4271-A093-6D24F8E0C2FE} | Query hash: 0x9CCB08D8E9D29322 | Distributed request ID: {D0BA926E-FAB2-47DD-8309-BE91DFAC48F8}.
Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.

Did you try with a T in the middle? _a la_ `0001-01-01T00:00:00.000` — Randolph West, Nov 08 '21 at 21:40
@RandolphWest I like your thinking! however the string format isn't important. the string gets successfully converted to a timestamp without issue. it's only after the timestamp-formatted column (saved to parquet) is loaded into Synapse SQL that the issue occurs. — Anders Swanson, Nov 08 '21 at 21:43
Is this related to your issue? https://learn.microsoft.com/en-us/answers/questions/181820/reading-datetime-from-parquet.html. Looks like there's a "no year zero" issue at play. — Randolph West, Nov 08 '21 at 21:46
I believe this to be a bug in Synapse. the error message does look to be the same, though they're blaming ADF for the conversion. in this case, I'm using pyspark and parquet out of the box. but great find! def related. — Anders Swanson, Nov 08 '21 at 21:51
Ew, this looks like an Excel thing (doesn't it always subtract 2 days when converting a numeric to a date?) — Aaron Bertrand, Nov 09 '21 at 18:50

score 1 · Answer 1 · answered Jul 17 '23 at 16:59

I had the same error, I've solved this issue saving as datatype string (so do not use to_timestamp) in the adls and then converting to datetime2 into the Synapse query. This sound strange but it actually works. using your code it will be something like this:

SELECT
    cast(my_ts as datetime2) as my_ts
FROM
    OPENROWSET(
        BULK 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/**',
        FORMAT = 'PARQUET'
    ) WITH (my_ts as nvarchar(27)) AS [result]

score 0 · Answer 2 · answered Oct 03 '22 at 20:16

0

The answer is that it is a bug with Synapse serverless pool's parquet reader. I'm not sure if it has been fixed yet.

answered Oct 03 '22 at 20:16

Anders Swanson

3,637
1
18
43

can't load parquet timestamp using Synapse serverless pool's OPENROWSET

TL;DR

Reproducible Example

Make a parquet file

Load a parquet file

2 Answers2

Linked

Property	Value
Date range	0001-01-01 through 9999-12-31 January 1,1 CE through December 31, 9999 CE
Time range	00:00:00 through 23:59:59.9999999