I am processing approximately 19,710 directories containing IIS log files in an Azure Synapse Spark notebook. There are 3 IIS log files in each directory. The notebook reads the 3 files located in the directory and converts them from text delimited to Parquet. No partitioning. But occasionally I get the following two errors for no apparent reason.
{
"errorCode": "2011",
"message": "An error occurred while sending the request.",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above all of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
{
"errorCode": "6002",
"message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above none of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
In both cases you can see that the notebook did run for a period of time. I have enabled 1 retry on the spark notebook, it is a pyspark notebook that does python for the parameters with the remainder of the logic using C# %%csharp. The spark pool is small (4 cores/ 32GB) with 5 nodes.
The only conversion going on in the notebook is converting a string column to a timestamp.
var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
When I say this is random the pipeline is currently running and after processing 215 directories there are 2 of the first failure and one of the second.
Any ideas or suggestions would be appreciated.