Reading Spark Dataframe from Partitioned Parquet data

Question

I have parquet data stored on S3 and Athena table partitioned by id and date. The parquet files are stored in

 s3://bucket_name/table_name/id=x/date=y/

The parquet file contains the partition columns in them (id, date), because of which I am not able to read them using AWS Glue.

I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:

today = date.today()
yesterday = today - timedelta(days = 1)

predicate = "date = date '" + str(yesterday) +"'"

df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)

However, since the files already contain the partition columns, I am getting the below error:

AnalysisException: Found duplicate column(s) in the data schema and the partition schema: id, date

Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id and date columns?

Any sort of help is appreciated :)

Shouldn't your predicate be like `"date == '2022-07-19'"` instead of `date = date '2022-07-19'` — Aman Sehgal, Jul 20 '22 at 01:52
That would just consider the date value as a string. So an explicit conversion is done here. — AswinRajaram, Jul 20 '22 at 14:28

Cribber · Answer 1 · 2022-07-20T05:57:18.993

3

Concerning your first question 'Is there a way I can read data from only a few partitions like this?':

You don't need to use predicate in my opinion - the beauty of having partitioned parquet files is that Spark will push any filter which is applied along those partitions down to the file scanning phase. Meaning that Spark is able to skip certain groups by just reading the metadata of the parquet files.

Have a look at the physical execution plan once you execute a df = spark.read()and df.filter(col("date") == '2022-07-19').

You should find something along the lines of

+- FileScan parquet [idxx, ... PushedFilters: [IsNotNull(date), EqualTo(date, 2022-07-19)..

Concerning whether you can read the data by ignoring id and date columns: You can potentially add multiple parquet paths to the read function at the bottom level - which would ignore the date/id columns alltogether (I don't know why you would do that though if you need to filter on them):

df = spark.read.parquet(
       "file:///your/path/date=2022-07-19/id=55/", 
       "file:///your/path/date=2022-07-19/id=40/") 

# Shorter solution: 
df = spark.read.parquet(
       "file:///your/path/date=2022-07-19/id={55, 40}/*")

edited Jul 20 '22 at 05:57

answered Jul 20 '22 at 05:45

Cribber

2,513
2
21
60

Thank you for your answer @Cribber. But the main issue here is that the parquet files contained in those folders have `date` and `id` as columns, because of which they are not being read. Do you know why that's happening or if there's a way to avoid it? – AswinRajaram Jul 20 '22 at 13:56
@AswinRajaram how about adding option `.option("recursiveFileLookup", "true")` when using `spark.read` – AdibP Jul 21 '22 at 02:16
When you `.write()` partitioned parquet data to a drive, the folder-structure `/date=2022-19-02/id=1/` is created automatically - but the data for the partitioned columns are **not** included in the files themselves (because they can be deduced from the directory structure - it saves memory not to write them into the actual files again). I cannot say why the data is still included in yours without seeing how you wrote the files / how the files are stored exactly. – Cribber Jul 21 '22 at 07:38
@Cribber Yes, that's the case when Spark is used to write the data, but unfortunately, the data was written using Pandas, and hence the partition columns are present in the final parquet data. Is there any way we can still read the files using Spark? – AswinRajaram Jul 25 '22 at 08:30
The only way I see is to use either AdibP's solution with the `recursiveFileLookup` option or you gather all directory paths of the lowest level individually and pass them all to `spark.read.parquet()` at once. I haven't tested it myself though... alternatively you can rename the folders so that Spark doesn't recognize it as partitioned column values. But with any of these solutions you will **loose the performance boost** from the pushed down filtering along the partitions as Spark won't know that the data is partitioned. – Cribber Jul 25 '22 at 10:09
Perhaps you can also change something in pandas concerning the way the data is written, but I don't know enough about pandas to be able to give advice there. – Cribber Jul 25 '22 at 10:10

Reading Spark Dataframe from Partitioned Parquet data

1 Answers1