I am trying to understand how to create an external table that supports partition elimination. I can create a view with a column derived using the filepath function, but that can't be used by Spark. I can create an external table using create external table as select, but that gives me a copy of the data. This article from Microsoft implies it can be done.
The native external tables in Synapse pools are able to ignore the files placed in the folders that are not relevant for the queries. If your files are stored in a folder hierarchy (for example - /year=2020/month=03/day=16) and the values for year, month, and day are exposed as the columns, the queries that contain filters like year=2020 will read the files only from the subfolders placed within the year=2020 folder. The files and folders placed in other folders (year=2021 or year=2022) will be ignored in this query. This elimination is known as partition elimination.
The folder partition elimination is available in the native external tables that are synchronized from the Synapse Spark pools. If you have partitioned data set and you would like to leverage the partition elimination with the external tables that you create, use the partitioned views instead of the external tables.
So, how do you expose those partition directories as columns in an external table?