0

I have data in ADLS in form of parquet tables "partitioned" by date:

gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/

Data is loaded into these partitions using:

file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name, 
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)

df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)

I created Spark table on top of it using:

create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')

I was not able to add anything about partitions to Create Table statement like

partition by (YYYY, YYYYMM, YYYYMMDD)

I was hoping that will be able to avoid reading all partitions when I am using Spark SQL. However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL. I have another col TxDate - that contains the date that I am looking for:

select count(*) from database.fact 
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221
DejanS
  • 96
  • 9

0 Answers0