0

I have S3 files in the following path formats:

s3://bucket_name/src=email/year=2022/month=9/day=10/hour=1
s3://bucket_name/src=email/year=2022/month=9/day=10/hour=2
.
.
s3://bucket_name/src=sms/year=2022/month=9/day=10/hour=1
s3://bucket_name/src=sms/year=2022/month=9/day=10/hour=2
.
.

I want to read the data for 1 particular date e.g. 2022-09-10 using PySpark. I am using below code for this:

df = spark.read.parquet("s3://bucket_name/*/year=2022/month=9/day=10/")

This gives me below error:

An error occurred while calling o471.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

I have tried setting basePath as well but that gives another error. Any help to read data from multiple partitions using spark?

seou1
  • 446
  • 1
  • 5
  • 21

0 Answers0