5

I have data files (json in this example but could also be avro) written in a directory structure like:

dataroot
+-- year=2015
    +-- month=06
        +-- day=01
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=02
            +-- data1.json
            +-- data2.json
            +-- data3.json
    +-- month=07
        +-- day=20
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=21
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=22
            +-- data1.json
            +-- data2.json

Using spark-sql I create a temporary table:

CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
  path "dataroot/*"
)

Querying the table works well but I'm so far not able to use the directories for pruning.

Is there a way to register the directory structure as partitions (without using Hive) to avoid scanning the whole tree when I query? Say I want to compare data for the first day of every month and only read directories for these days.

With Apache Drill I can use directories as predicates during query time with dir0 etc. Is it possible to do something similar with Spark SQL?

Dev
  • 13,492
  • 19
  • 81
  • 174
Lundahl
  • 6,434
  • 1
  • 35
  • 37

2 Answers2

3

As far as I know partitioning autodiscovery only works for parquet files in SparkSQL. See http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • While parquet file reading does support the discovery Hive-like partitions. From my experience, the data directories aren't loaded as partitions. I.e., when you query on the partitioned key, you are still scanning the whole tree (i.e., no pruning - not what the asker wants). – Jack Leow Sep 07 '16 at 18:28
  • @jack it uses the partitions to skip and not read directories (again for parquet files only) – Arnon Rotem-Gal-Oz Sep 08 '16 at 18:38
  • I actually learnt a bit more since that comment. So the optimization you gain from partitions available, but only when using the DataFrames API. If you convert to RDD (or even when using Datasets API, as far as I can tell), it will always read all directories, even when it doesn't have to. IOW, what you say is accurate, but only when using the DataFrame API. – Jack Leow Sep 14 '16 at 22:51
  • is partition pruning available on text files based tables in Spark now? – Harikrishnan Ck Jul 06 '17 at 01:26
2

Use EXPLAIN to see the physical plan so which folder will be scanned.

Also, you can describe the partition when creating the table so Spark can use it.

I am not sure Spark 1.6 use correctly partition pruning, settings spark.sql.hive.convertMetastoreParquet to false, I can see it but to true (default), I can see Spark will scan all partitions (but this is not affecting performance at all).

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
  • Every problem is different, however on different problem settings flipping that switch has helped. IMHO, it is always worth a try. – Marcel Flygare Jan 23 '20 at 13:00