0

I have a dataframe with account_id column. I want to group all of the distinct account_id rows and write to different S3 buckets. Writing to a new folder for each account_id within a given S3 bucket works too.

Prabhakar Reddy
  • 4,628
  • 18
  • 36
pnhegde
  • 695
  • 1
  • 8
  • 19

1 Answers1

2

If you want all similar account_ids to be present in one folder then you can achieve it via partitionBy function. Below is an example which will group all the account_ids and write them in parquet format to different folders. You can change the mode depending on your use case.

df.write.mode("overwrite").partitionBy('account_id').parquet('s3://mybucket/')

If you want multiple partitions then you can do so by adding the columns to partitionBy function. For example consider you have a column date with values of format yyyy/mm/dd then below snippet will create folders again inside account_id with multiple dates.

df.write.mode("overwrite").partitionBy('account_id','date').parquet('s3://mybucket/')

will write files to S3 in below format:

s3://mybucket/account_id=somevalue/date=2020/11/01
s3://mybucket/account_id=somevalue/date=2020/11/02
s3://mybucket/account_id=somevalue/date=2020/11/03
......
s3://mybucket/account_id=somevalue/date=2020/11/30
Prabhakar Reddy
  • 4,628
  • 18
  • 36
  • Yes. `partitionBy` works fine. Is it possible to add multiple partition keys? under each `account_id` folder, I want to create yyyy/mm/dd sub folders. – pnhegde Aug 07 '20 at 06:16
  • Thanks. One more question, How do I hide the key name from subfolders? Instead of `s3://mybucket/account_id=somevalue/date=2020/11/01`, can we make it `s3://mybucket/somevalue/date=2020/11/01` ? – pnhegde Aug 07 '20 at 06:46
  • 1
    I don't think that is possible as this behaviour is by design which helps you to fave lot of time and resources when you apply filter on this partition so that it will omit all other partitions which are not part of the query – Prabhakar Reddy Aug 07 '20 at 06:59