Azure Synapse Pipeline running Spark Notebook Generates Random Errors

Question

I am processing approximately 19,710 directories containing IIS log files in an Azure Synapse Spark notebook. There are 3 IIS log files in each directory. The notebook reads the 3 files located in the directory and converts them from text delimited to Parquet. No partitioning. But occasionally I get the following two errors for no apparent reason.

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

When I get the error above all of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.

sometimes I get

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

When I get the error above none of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.

In both cases you can see that the notebook did run for a period of time. I have enabled 1 retry on the spark notebook, it is a pyspark notebook that does python for the parameters with the remainder of the logic using C# %%csharp. The spark pool is small (4 cores/ 32GB) with 5 nodes.

The only conversion going on in the notebook is converting a string column to a timestamp.

var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));

When I say this is random the pipeline is currently running and after processing 215 directories there are 2 of the first failure and one of the second.

Any ideas or suggestions would be appreciated.

Seeing another random error that I will have to investigate after the pipeline finishes `"errorCode": "6002", "message": "[2022-03-02T12:09:41.8223708Z] [vm-18712171] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'collectToPython' failed for class '37' when called with no arguments\n[2022-03-02T12:09:41.8227074Z] [vm-18712171] [Error] [JvmBridge] java.io.IOException: Stream is corrupted` — bmukes, Mar 02 '22 at 16:17

score 1 · Answer 1 · answered Mar 07 '22 at 15:29

OK after running for 113 hours (its almost done) I am still getting the following errors but it looks like all of the data was written out

Count 1

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 1

{
    "errorCode": "6002",
    "message": "Exception: Failed to create Livy session for executing notebook. LivySessionId: 4419, Notebook: Convert IIS to Raw Data Parquet.\n--> LivyHttpRequestFailure: Something went wrong while processing your request. Please try again later. HTTP status code: 500. Trace ID: e0860852-40e6-498f-b2df-4eff9fee504a.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 17

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Not sure what these errors are about and of course I will rerun the specific data in the pipeline to see if this is a one-off or keeps occurring on this specific data. But it seems as if these errors or occurring after the data as been written to parquet format.

score 0 · Answer 2 · answered Mar 03 '22 at 15:39

Well I think this is part of the issue. Keep in mind that I am writing the main part of the logic in C# so your mileage in another language may vary. Also these are IIS log files that are space delimited and they can be multiple megabytes in size like one file could be 30MB.

My new code has been running for 17 hours without a single error. All of the changes I made were to ensure that I disposed of resources that would consume memory. Examples follow:

When reading a text delimited file as a binary file

    var df = spark.Read().Format("binaryFile").Option("inferSchema", false).Load(sourceFile) ;            
    byte[] rawData = df.First().GetAs<byte[]>("content");

the data in the byte[] eventually gets loaded into a List<GenericRow> but I never set the variable rawData to null.

After filling the byte[] from data frame above I added

    df.Unpersist() ;

After fully putting all data into List<GenericRow> rows from the byte[] and adding it into a data frame using the code below I cleared out the rows variable.

    var dfparquetTemp = spark.CreateDataFrame(rows,inputSchema);
    rows.Clear() ;

finally after changing a column type and writing out the data I did an unpersist on the data frame.

    var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
    if(overwrite) {
        dfConverted.Write().Mode(SaveMode.Overwrite).Parquet(targetFile) ;
    }
    else {
        dfConverted.Write().Mode(SaveMode.Append).Parquet(targetFile) ;
    }
    dfConverted.Unpersist() ;

finally I have most of my logic inside of a C# method that gets called in a foreach loop with the hopes that the CLR will dispose of anything else I missed.

And last but not least a lesson learned.

When reading a directory containing multiple parquet files it seems that spark reads all of the files into the data frame.
When reading a directory containing multiple text delimited files that you are treating as binary files spark reads only ONE of the files into the data frame.

So in order to process multiple text delimited files out of a folder I had to pass in the names of the multiple files and process the first file with an SaveMode.Overwrite and the other files as SaveMode.Append. Every method of attempting to use any kind of wild card and specifying the directory name only ever resulted in reading one file into the data frame. (Trust me here after hours of GoogleFu I tried every method I could find.)

Again 17 hours into processing not one single error so one important lesson seems to be to keep your memory usage as low as possible.

score 0 · Answer 3 · answered Mar 10 '22 at 15:37

OK I am adding another answer rather than editing the existing ones. After 113 hours I had 52 errors that I had to reprocess. I found that some of the errors were due to Kryo serialization failed: Buffer overflow. Available: 0, required: 19938070. To avoid this, increase spark.kryoserializer.buffer.max well after a few hours of GoogleFu which also included increasing the size of my spark pool from small to medium (had no effect) I added this as the first cell in my notebook

%%configure
{
    "conf":
    {
        "spark.kryoserializer.buffer.max" : "512"
    }
}

So this fixed the Kryo serialization failed issue and I believe that the larger spark pool has fixed all of the remaining errors because they are now all processing successfully. Also jobs that previously failed after taking 2 hours to run are now completing after 30 minutes. I suspect this speed increase is due to the larger spark pool memory. So lesson learned. Do not use the small pool for IIS files.

Finally something that bugged me. when you type %%configure into an empty cell Microsoft so unhelpfully puts in the following crap

%%configure
{
    # You can get a list of valid parameters to config the session from https://github.com/cloudera/livy#request-body.
    "driverMemory": "28g", # Recommended values: ["28g", "56g", "112g", "224g", "400g", "472g"]
    "driverCores": 4, # Recommended values: [4, 8, 16, 32, 64, 80]
    "executorMemory": "28g",
    "executorCores": 4,
    "jars": ["abfs[s]: //<file_system>@<account_name>.dfs.core.windows.net/<path>/myjar.jar", "wasb[s]: //<containername>@<accountname>.blob.core.windows.net/<path>/myjar1.jar"],
    "conf":
    {
        # Example of standard spark property, to find more available properties please visit: https://spark.apache.org/docs/latest/configuration.html#application-properties.
        "spark.driver.maxResultSize": "10g",
        # Example of customized property, you can specify count of lines that Spark SQL returns by configuring "livy.rsc.sql.num-rows".
        "livy.rsc.sql.num-rows": "3000"
    }
}

I call it crap because IT HAS COMMENTS IN IT. If you try and just add in the one setting you want it will fail due to the comments. JUST BE WARNED.

Azure Synapse Pipeline running Spark Notebook Generates Random Errors

3 Answers3

Linked