1

I am new AWS glue. I need to write each record in a dynamic frame to a custom folder path in s3. For example


Following is the target s3 path:

<bucket>/parentfolder/<year>/<month>/<day>/<somegroupid>/<random_file_name>.json

Here, 'year', 'month', 'day', 'somegroupid' are available as columns in each record.

Is it possible to use column values in the record to decide on the path where the JSON file will be written?

Karthik
  • 55
  • 1
  • 8
  • 1
    In Pyspark, you can use partitionBy when writing your DataFrame to S3: `df.write.partitionBy('year', 'month', 'day', 'somegroupid').json("/parentfolder/")` – blackbishop Jan 21 '21 at 13:46
  • thanks @blackbishop. i shall look into this. – Karthik Jan 21 '21 at 13:50
  • 1
    glueContext.write_dynamic_frame.from_options(frame = dynamicframe2, connection_type = "s3", connection_options = {"path": "s3://path/","partitionKeys": ["year", "month", "day", "somegroupid"]}, format = "json", transformation_ctx = "datasink3") i could find the above equivalent for glue, it worked. Thanks for your guidance @blackbishop – Karthik Jan 21 '21 at 14:27

1 Answers1

0

Please see Managing Partitions for ETL Output in AWS Glue - Writing Partitions

glue_context.write_dynamic_frame.from_options(
    frame = projectedEvents,
    connection_type = "s3",    
    connection_options = {"path": "$outpath", "partitionKeys": ["year", "month", "day", "somegroupid"]},
    format = "parquet")

This would give you: s3://my_bucket/logs/year=2018/month=01/day=23/

Unfortunately there doesn't seem to be a way to get rid of the field=value because it can be valuable in some cases:

Crawlers not only infer file types and schemas, they also automatically identify the partition structure of your dataset when they populate the AWS Glue Data Catalog. The resulting partition columns are available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.

Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by partition value without having to read all the underlying data from Amazon S3.

Stefan
  • 747
  • 8
  • 11