Based on the following thread in GitHub (https://github.com/databricks/spark-csv/issues/45) I understand that CREATE TABLE + Options (like JDBC), will create a Hive external table?. These type of tables don't materialize themselves and hence no data is lost when the table is dropped vial SQL or removed from Databricks Tables UI.
Asked
Active
Viewed 9,525 times
2 Answers
2
You can very well create an EXTERNAL
table in spark, but you have to take care of using HiveContext
instead of SqlContext
:
scala> import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._
scala> val hc = new HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@385ff04e
scala> hc.sql("create external table blah ( name string ) location 'hdfs:///tmp/blah'")
res0: org.apache.spark.sql.DataFrame = [result: string]

Roberto Congiu
- 5,123
- 1
- 27
- 37
1
From Spark 2.0 docs: https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#hive-tables
In Spark SQL : CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-provided locations. That means, a Hive table created in Spark SQL with the user-specified location is always a Hive external table. Dropping external tables will not remove the data. Users are not allowed to specify the location for Hive managed tables. Note that this is different from the Hive behavior.

Paul Bendevis
- 2,381
- 2
- 31
- 42