1

I am using data wrangler to upload data from a dataframe into S3 bucket parquet files, and am trying to get it in a 'Hive'-like folder structure of:

prefix
- year=2022
-- month=08
--- day=01
--- day=02
--- day=03

In the following code example:

import awswrangler as wr
import pandas as pd
wr.s3.to_parquet(
    df=pd.DataFrame({
        'date': ['2022-08-01', '2022-08-02', '2022-08-03'],
        'col2': ['A', 'A', 'B']
    }),
    path='s3://bucket/prefix',
    dataset=True,
    partition_cols=['date'],
    database='default'
)

The resulting s3 folder structure would be:

prefix
- date=2022-08-01
- date=2022-08-02
- date=2022-08-03

The Sagemaker feature store ingest function (https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html) sort of does this automatically with the event_time_feature_name column (timestamp) automatically creating the Hive file structure in S3.

How can I do this with Data Wrangler without creating 3 additional columns from the 1 column and declaring them as partitions, but put in 1 column and have the partitions by year month and day automatically created?

Ben L
  • 83
  • 8

0 Answers0