Adding pertition to the external table definition does not help with a query on the partition.
Blob path example
- /data/1234/2021/12/02/9483D.parquet
- /data/1235/2021/12/02/12345.parquet
Partition (pseudo syntax not the real one) : '/data/'uniqueid'/yyyy/MM/dd/'
So only two uniqueids values are in the storage path. Total files count ~ 1 million for different dates in the path
So I defined 2 partitions as virtual columns:
- uniqueid
- datetime
Executing a query on the uniqueid like:
table | summarize by uniqueid
goes over all files in the blob storage for some reason.
As the uniqueid is a partition and as virtual column, shouldn't the query be super fast as we have only 2 values in the path for it? Am I totally missing the point of partitioning?
EDIT add smaple:
.create external table ['sensordata'] (['timestamp']:long,['value']:real)
kind = adl
partition by (['uniqueid']:string ,['datecreated']:datetime )
pathformat = (['uniqueid'] '/' datetime_pattern("yyyy/MM/dd", ['daterecorded']))
dataformat = parquet
(
h@'abfss://XXXXXX@YYYYYYYY.dfs.core.windows.net/histdata;impersonate'
)
with (FileExtension='.parquet')
Query sample:
sensordata
| summarize by uniqueid