I am using Data wrangler to read parquet datasets. The partition has 300 files and each files are around 256 mb. I am using sagemaker ml.r5.24xlarge which has 96 cores. Processing job is doing 3 task.
- Read parquet file.
- Execute model
- Write the model output in parquet format.
There are two options for me.
- Read file each and process.
- Read all the files at one go and process.
I see it's taking 3 hrs to option using the 1st process. Can any one suggest how can I use multiple threads to read data in parallel.
wr: data wrangler
for key in inference_s3_keys[:1] :
print(key)
s3fullpath='s3://'+databucket+'/'+key
input_data=wr.s3.read_parquet(path=s3fullpath)
input_data=wr.s3.read_parquet(path=s3fullpath ,use_threads =True)
print("s3file",s3fullpath)
logging.info(f"s3file: {s3fullpath}")
logging.info(f"Size of the Dataframe {len(input_data)}")
logging.info(f"Dataframe memory: {int(input_data.memory_usage(index=True, deep=True).sum() / 1000000)} MB")
logging.debug(f"Dataframe value counts {input_data.dtypes.value_counts()}")
for col in constants.CATEGORICAL_FEATURES:
input_data[col] = input_data[col].astype('category')
for col in constants.NUMERIC_FEATURES:
logging.info(col)
input_data[col] = input_data[col].astype('float32')
input_data['target_1y_label'] = input_data['target_1y_label'].astype('float32')
model_input_data = input_data.copy()
input_data=input_data.drop(['cd','gd','tabel','targetass'],axis=1)
output=model_pipeline.predict(input_data)
model_input_data['gbm_p1']=output
pd.set_option('display.max_columns', None)
wr.s3.to_parquet(model_input_data,path=s3_output_data_dir,dataset=True,compression='snappy',use_threads=True)
Regards Sanjeeb