Which pandas module can be used read parquet file in parallel?

Question

I am using Data wrangler to read parquet datasets. The partition has 300 files and each files are around 256 mb. I am using sagemaker ml.r5.24xlarge which has 96 cores. Processing job is doing 3 task.

Read parquet file.
Execute model
Write the model output in parquet format.

There are two options for me.

Read file each and process.
Read all the files at one go and process.

I see it's taking 3 hrs to option using the 1st process. Can any one suggest how can I use multiple threads to read data in parallel.

wr: data wrangler

for key in inference_s3_keys[:1] :
        print(key)
        s3fullpath='s3://'+databucket+'/'+key
        input_data=wr.s3.read_parquet(path=s3fullpath)
        input_data=wr.s3.read_parquet(path=s3fullpath ,use_threads =True)
        print("s3file",s3fullpath)
        logging.info(f"s3file: {s3fullpath}")
        logging.info(f"Size of the Dataframe {len(input_data)}")
        logging.info(f"Dataframe memory: {int(input_data.memory_usage(index=True, deep=True).sum() / 1000000)} MB")
        logging.debug(f"Dataframe value counts {input_data.dtypes.value_counts()}")

        for col in constants.CATEGORICAL_FEATURES:
            input_data[col] = input_data[col].astype('category')

        for col in constants.NUMERIC_FEATURES:
            logging.info(col)
            input_data[col] = input_data[col].astype('float32')
        input_data['target_1y_label'] = input_data['target_1y_label'].astype('float32')

        model_input_data = input_data.copy()
        input_data=input_data.drop(['cd','gd','tabel','targetass'],axis=1)
        output=model_pipeline.predict(input_data)
        model_input_data['gbm_p1']=output

        pd.set_option('display.max_columns', None)
        wr.s3.to_parquet(model_input_data,path=s3_output_data_dir,dataset=True,compression='snappy',use_threads=True)

Regards Sanjeeb

You can path a list of path to `s3.read_parquet`, that should load files in parallel https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html#awswrangler-s3-read-parquet — 0x26res, Jun 12 '23 at 14:14

Which pandas module can be used read parquet file in parallel?

0 Answers0