I have data in ADLS in form of parquet tables "partitioned" by date:
gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/
Data is loaded into these partitions using:
file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name,
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)
df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)
I created Spark table on top of it using:
create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')
I was not able to add anything about partitions to Create Table statement like
partition by (YYYY, YYYYMM, YYYYMMDD)
I was hoping that will be able to avoid reading all partitions when I am using Spark SQL. However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL. I have another col TxDate - that contains the date that I am looking for:
select count(*) from database.fact
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221