How to query with partition pruning Apache Spark table over ADLS

Asked Jan 18 '23 at 18:54

Active Jan 18 '23 at 19:12

Viewed 114 times

I have data in ADLS in form of parquet tables "partitioned" by date:

gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/

Data is loaded into these partitions using:

file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name, 
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)

df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)

I created Spark table on top of it using:

create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')

I was not able to add anything about partitions to Create Table statement like

partition by (YYYY, YYYYMM, YYYYMMDD)

I was hoping that will be able to avoid reading all partitions when I am using Spark SQL. However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL. I have another col TxDate - that contains the date that I am looking for:

select count(*) from database.fact 
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221

edited Jan 18 '23 at 19:12

asked Jan 18 '23 at 18:54

DejanS

How to query with partition pruning Apache Spark table over ADLS

0 Answers0