2

I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.

Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.

The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:

  • Replace S3 and save to local file system: The same result.
  • Replace "pickle" with "joblib" and "BytesIO": The same result.
  • Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]

So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.

def train_sgbt(pdf):      
       ##Some data transformations here##    
       #Train the model
       sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
       sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
       #Initiate s3_client
       s3_client = boto3.client(--Params.--)
       #Put file in S3
       s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name', 
            Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")    
       return pdf

dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt, 
           schema="fcast_error double")
dummy_df.show()

1 Answers1

0

Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.

The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.

One way to solve to problem would be to call another action, for example csv.

werner
  • 13,518
  • 6
  • 30
  • 45