I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view in my Azure Synapse Serverless SQL pool that I can connect to and query data for reports.
Every once in a while I will run a query and get back the following error:
Error handling external file: 'Invalid metadata in parquet file. Number of rows in metadata does not match actual number of rows in parquet file.'. File/External table name: 'https://test123.dfs.core.windows.net/devicetelemetry/2021/12/03/20/-1875592941_d9a0239529f04e1eb587b83d50bbb590_1.parquet'.
When I try to open the .paqruet file in question with "Apache Parquet Viewer" or any other parquet viewer it errors and refuses to open the file complaining that the column named 'data' is undefined. The data column is a JSON string with various sensor readings from IoT devices.
My StreamAnalytics query that sends data to my ADLS looks like this:
SELECT
Tel.identities.corporationId AS corporationId, Tel.identities.deviceId,
Tel.deviceTelemetry.version, Tel.deviceTelemetry.TimeStamp AS dateTimeStamp,
Tel.deviceTelemetry.data
INTO
deviceTelemetryADLS
FROM
data AS Tel
WHERE Tel.deviceTelemetry.data IS NOT null
Question
Wouldn't the WHERE clause used here ensure that the data is always present? Otherwise it would not be sent? Otherwise how else are my .parquet files being corrupted?