0

Adding pertition to the external table definition does not help with a query on the partition.

Blob path example

  • /data/1234/2021/12/02/9483D.parquet
  • /data/1235/2021/12/02/12345.parquet

Partition (pseudo syntax not the real one) : '/data/'uniqueid'/yyyy/MM/dd/'

So only two uniqueids values are in the storage path. Total files count ~ 1 million for different dates in the path

So I defined 2 partitions as virtual columns:

  1. uniqueid
  2. datetime

Executing a query on the uniqueid like: table | summarize by uniqueid goes over all files in the blob storage for some reason.

As the uniqueid is a partition and as virtual column, shouldn't the query be super fast as we have only 2 values in the path for it? Am I totally missing the point of partitioning?

EDIT add smaple:

.create external table ['sensordata'] (['timestamp']:long,['value']:real)
    kind = adl
partition by (['uniqueid']:string ,['datecreated']:datetime )
pathformat = (['uniqueid']  '/' datetime_pattern("yyyy/MM/dd", ['daterecorded']))
    dataformat = parquet
    (
        h@'abfss://XXXXXX@YYYYYYYY.dfs.core.windows.net/histdata;impersonate'
    )
    with (FileExtension='.parquet')

Query sample:

sensordata
| summarize by uniqueid
user998888
  • 11
  • 3
  • 1
    Can you provide an example of the external table definition and the query itself? If it is set up correctly and the query filters based on the partitioning column it should work as expected. – Avnera Feb 26 '22 at 14:22
  • Edit: added table sample and query sample – user998888 Feb 27 '22 at 14:25

1 Answers1

0

Thanks for your input, @user998888.

We have many optimizations for partitioned external tables, and we invest significant effort in adding more and more optimizations. But we still haven't optimized the type of query like the one you provided. It's on our list.

Slavik N
  • 4,705
  • 17
  • 23