0

In spark-shell, how do I load an existing Hive table, but only one of its partitions?

val df = spark.read.format("orc").load("mytable")

I was looking for a way so it only loads one particular partition of this table.

Thanks!

ernest_k
  • 44,416
  • 5
  • 53
  • 99
kcode2019
  • 119
  • 1
  • 7

2 Answers2

2

There is no direct way in spark.read.format but you can use where condition

val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)

unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.

see here DataFrameReader

 def load(paths: String*): DataFrame = {
...
}

In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)

when you say df.count then your parition column will be appled on data path of orc.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
0

There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.

val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")
Ajay Kharade
  • 1,469
  • 1
  • 17
  • 31