0

File skipping in delta files is when you skipping reading the file altogether because you know that the the value you are looking for will not exist in the file. This is determined by looking at the column stats. Reading about File pruning - it seems to be doing a similar job. Are these two terms the same and used interchangebly or is there a difference between the two?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Ashwin
  • 12,691
  • 31
  • 118
  • 190

1 Answers1

0

File skipping is a specific technique for collecting some statistics & then using them for identification of the files that may contain the data.

Dynamic file pruning is a specific Spark optimization for performing efficient joins, and other related operations by utilizing file skipping statistics, etc. Prior to that optimization the file skipping data were used only for "static" filters (where conditions, etc.). You can read more details in the following blog post.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • is dynamic file pruning and file pruning one and the same? – Ashwin Jul 05 '23 at 09:27
  • there is a static file pruning and dynamic file pruning. The differences are described in the linked blog post. – Alex Ott Jul 05 '23 at 09:38
  • @AlexOtt - Sorry for the tag. is the below understanding correct ? As per the blog, for Q2 - * File Skipping uses the statistics (min and max) available to it to skip the files that do not have the i_item_id as 'AAA*' * Dynamic File Pruning - For the identified files, it automatically identified the relevant i_item_sk's available and added a dynamic filter against ss_item_sk in the bigger table. In other words file skipping uses collected statistics to skip files that are not in range. DDP uses file skipping among other information (what other info ?) to optimize join performance. – rainingdistros Jul 05 '23 at 10:38
  • @AlexOtt I had read the blog prior to asking the question. There are some places where the term "File pruning" is used (without static or dynamic). Eg: Spark statistics gives a metric "Files Pruned". Is it safe to assume that it is static file pruning being talked about when "static" or "dynamic" is not mentioned explicitly? – Ashwin Jul 05 '23 at 11:26
  • metric should cover both types – Alex Ott Jul 05 '23 at 11:35