3

Based on the following thread in GitHub (https://github.com/databricks/spark-csv/issues/45) I understand that CREATE TABLE + Options (like JDBC), will create a Hive external table?. These type of tables don't materialize themselves and hence no data is lost when the table is dropped vial SQL or removed from Databricks Tables UI.

jmdev
  • 729
  • 1
  • 6
  • 13

2 Answers2

2

You can very well create an EXTERNAL table in spark, but you have to take care of using HiveContext instead of SqlContext:

scala> import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala> val hc = new HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@385ff04e

scala> hc.sql("create external table blah ( name string ) location 'hdfs:///tmp/blah'")
res0: org.apache.spark.sql.DataFrame = [result: string]
Roberto Congiu
  • 5,123
  • 1
  • 27
  • 37
1

From Spark 2.0 docs: https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#hive-tables

In Spark SQL : CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-provided locations. That means, a Hive table created in Spark SQL with the user-specified location is always a Hive external table. Dropping external tables will not remove the data. Users are not allowed to specify the location for Hive managed tables. Note that this is different from the Hive behavior.

Paul Bendevis
  • 2,381
  • 2
  • 31
  • 42