TL;DR
The first possible parquet timestamp, 0001-01-01 00:00:00.000
cannot be loaded with Azure Synapse Serverless SQL Pools, as DATETIME2
or any other type. I'm baffled bc the DATETIME2
docs say that this is within the acceptable range:
Property Value Date range 0001-01-01 through 9999-12-31 January 1,1 CE through December 31, 9999 CE Time range 00:00:00 through 23:59:59.9999999
user @RandolphWest found this relevant thread that gets the same error message but they lay the blame on ADF.
Reproducible Example
Here's how to make a file that will reproduce the error. Alternatively, here's a Dropbox link to the file
Make a parquet file
from pyspark.sql.functions import to_timestamp
dfx = (
spark.createDataFrame(
data = [ ("1","0001-01-01 00:00:00.000")],
schema=["id","my_ts"])
.withColumn("my_ts",to_timestamp("my_ts"))
)
dfx.write.parquet('abfss://<MY-CONTAINER>@<MY-WORKSPACE>.dfs.core.windows.net/dfx')
display(dfx)
+---+-------------------+
| id| my_ts|
+---+-------------------+
| 1|0001-01-01 00:00:00|
+---+-------------------+
Load a parquet file
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/**',
FORMAT = 'PARQUET'
) AS [result]
Error message
Failed to execute query.
Error: Error handling external file: 'Inserting value to batch for column type DATETIME2 failed.
Invalid argument provided.'.
File/External table name: 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/part-00015-f98c11bd-ffb4-45de-b505-0aa226183098-c000.snappy.parquet'.
Statement ID: {F3062AA6-DD96-4271-A093-6D24F8E0C2FE} | Query hash: 0x9CCB08D8E9D29322 | Distributed request ID: {D0BA926E-FAB2-47DD-8309-BE91DFAC48F8}.
Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.