1

I have data in a blob storage account, which is organized in the following structure:

/logs/YYYY/mm/dd/HH

Let's say there is a few years of data and I want to pull data from a specific time period (or time periods) into an Azure Data Warehouse table.

How does the following differ in terms of how Polybase retrieves the data:

1) Create table as Select from External table pointing to the root folder (/logs) with a where clause which takes into account a range of values for a particular time value (e.g Timestamp between x and y).

2) A CTAS from a UNION of external tables which point to the individual folders that make up the specific time periods I want.

I assume 2) above will be quicker, because Polybase doesn't have to look in any unwanted folder? Is this correct?

If so, is there a way to select from an external table which points to a root folder, but in some way only selecting files from a specific time period?

Perhaps there's some documentation that explains this?

Shane
  • 187
  • 2
  • 13

1 Answers1

3

Yes, you are correct.

1 will bring in all the data and then apply where clause in SQL DW.

2 will be quicker as it will only bring the data from the individual folder.

Currently, PolyBase external tables are not folder/file/partition aware to be able to skip based on filter clauses.

Community
  • 1
  • 1