0

The default

spark-shell --conf spark.hadoop.metastore.catalog.default=hive
val df:Dataframe = ...
df.write.saveAsTable("db.table") 

fails as it tries to write a internal / managed / transactional table (see How to write a table to hive from spark without using the warehouse connector in HDP 3.1).

How can I tell spark to not create a managed, but rather an external table?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Did you try to map your DataFrame as an SQL "TempView", then run an SQL command such as `CREATE EXTERNAL TABLE ... AS SELECT ...`? – Samson Scharfrichter Oct 16 '19 at 16:23
  • Is there a way to automate this for a dataframe with hundreds of columns? I am not sure about automatically mapping spark types to hive types. – Georg Heiler Oct 16 '19 at 16:24
  • `SELECT *` has been around since the dawn of SQL... I hope Hive and Spark support such a basic feature. – Samson Scharfrichter Oct 16 '19 at 16:25
  • You can also specify a file format that is not supported by Hive ACID, such as Parquet. – Samson Scharfrichter Oct 16 '19 at 16:27
  • But won't I have to specify a schema first for the create table as select? And There I am unsure about mapping the types. – Georg Heiler Oct 16 '19 at 16:28
  • Using SQL will give you more guarantees about Hive compatibility than raw Spark API calls. You also have a number of Spark properties for compatibility with "legacy" ORC or Parquet column formats -- alas few of these props are in the documentation, better check the Spark source code – Samson Scharfrichter Oct 16 '19 at 16:31
  • The point of `CREATE AS SELECT` is that you inherit the column names, order and types from the SELECT in your CREATE TABLE. If the SELECT contains expressions, then some implicit conversions are expected, of course. – Samson Scharfrichter Oct 16 '19 at 16:33

1 Answers1

1

For now disabling transactional tables by default looks like the best option to me.

Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.

set to false twice (tez, llap)

hive.strict.managed.tables = false

and enable manually in each table property if desired (to use a transactional table).

As a workaround using the manual CTAS could be an option as well.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • 1
    It turns out, this is not an ideal answer: only parquet files work fine, for ORC I still get an: `java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)` – Georg Heiler Oct 16 '19 at 19:14