Unable to open or query .parquet files due to corrupted column

Question

I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view in my Azure Synapse Serverless SQL pool that I can connect to and query data for reports.

Every once in a while I will run a query and get back the following error:

Error handling external file: 'Invalid metadata in parquet file. Number of rows in metadata does not match actual number of rows in parquet file.'. File/External table name: 'https://test123.dfs.core.windows.net/devicetelemetry/2021/12/03/20/-1875592941_d9a0239529f04e1eb587b83d50bbb590_1.parquet'.

When I try to open the .paqruet file in question with "Apache Parquet Viewer" or any other parquet viewer it errors and refuses to open the file complaining that the column named 'data' is undefined. The data column is a JSON string with various sensor readings from IoT devices.

My StreamAnalytics query that sends data to my ADLS looks like this:

SELECT 
  Tel.identities.corporationId AS corporationId, Tel.identities.deviceId, 
  Tel.deviceTelemetry.version, Tel.deviceTelemetry.TimeStamp AS dateTimeStamp, 
  Tel.deviceTelemetry.data
INTO 
  deviceTelemetryADLS
FROM 
  data AS Tel 
WHERE Tel.deviceTelemetry.data IS NOT null

Question

Wouldn't the WHERE clause used here ensure that the data is always present? Otherwise it would not be sent? Otherwise how else are my .parquet files being corrupted?

Hey Josh, if this is time sensitive, could you please open a support ticket in the Azure portal? If not, I'll try to reproduce the issue. If possible, would you mind sharing a sample input file that reproduce the issue? Best would be to send at askasa@microsoft.com — Florian Eiden, Dec 06 '21 at 18:48
FYI you can easily generate sample files from VS Code, in the input configuration file, once data has been previewed. — Florian Eiden, Dec 06 '21 at 18:48
I am going through this exact error. Have you solved it somehow? — Flavio Pegas, Apr 20 '22 at 01:43

score 2 · Accepted Answer · answered Apr 20 '22 at 13:17

The "data" column in question was a nested JSON array of dynamic objects which SA didn't seem to be able to convert correctly into parquet format. What we ended up doing was creating a Stream Analytics Function that took in this array and converted it to a string

Function

function main(InputJSON) {
  var InputJSONString = JSON.stringify(InputJSON);
  return InputJSONString;
}

Query

Tel.identities.corporationId AS corporationId, Tel.identities.deviceId, Tel.deviceTelemetry.version, Tel.deviceTelemetry.TimeStamp AS dateTimeStamp, udf.ConvertToJSONString(Tel.deviceTelemetry.data) as deviceData
INTO 
    deviceTelemetryADLS
FROM 
    data AS Tel 
WHERE Tel.deviceTelemetry.data IS NOT null

So we stored this nested JSON as a single column in our SQL as we didn't need to index these "data" fields. We deserialized it into an object when querying it to our .NET application

Unable to open or query .parquet files due to corrupted column

1 Answers1