I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
- Replace S3 and save to local file system: The same result.
- Replace "pickle" with "joblib" and "BytesIO": The same result.
- Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()