spark HWC cannot write to an existing table

Question

In HDP 3.1.0, HWC hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar, I cannot append (or overwrite) to an existing table depending on the database.

I tested on one datase called DSN, it works and on another database called CLEAN_CRYPT it fails. Both databases are crypted + kerberos

import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession._
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.execute("show databases").show()
hive.setDatabase("clean_crypt")
val df=hive.execute("select * from test")
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table","test").mode("append").save

The error message is "table already exists". I tried overwrite mode without success. If I drop the table, it passes !!!

Any idea ?

As you mention two different outcomes for different tables, please show the queries for both. Also are the tables similar? (types, partition) — Dennis Jaheruddin, Jan 28 '20 at 10:27
Hi Denis, Both tables has same structure in different "database".It's curious because a database is just a directory for hive. — Jeannot77680, Jan 28 '20 at 12:36
Are the security permissions (especially update) the same? Both in ranger and on hdfs — Dennis Jaheruddin, Jan 28 '20 at 13:23
the same test in beeline is working so it is not ranger. A statement: hive.executeQuery("insert into clean_crypt.test select * from dsn.test") is working too. It's only when using a dataframe — Jeannot77680, Jan 28 '20 at 13:29

zoltanctoth · Answer 1 · 2020-05-07T07:49:31.877

This is probably related to a HWC bug which is reported by multiple users here.

What I've found is that it only occurs if you try to use a partitionBy at writing, like:

df.write.partitionBy("part")
.mode(SaveMode.Overwrite)
.format(com.hortonworks.hwc.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "`default`.`testout`").save;

On an other note, if you remove the partitionBy piece, partitioning works as expected (as partition info is already stored in the Hive table), but if you use overwrite mode (and not, for example, append), HWC will drop and recreate your table and it won't reapply partitioning info.

Christos Natsis · Answer 2 · 2021-05-25T07:39:08.143

If you want to use the Hortnoworks connector and append to a partitioned table, you should not use partitionBy as it does not seem to work properly with this connector. Instead, you could use the partition options and add Spark parameters for dynamic partitioning.

Example:

import org.apache.spark.SparkConf
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR
import org.apache.spark.sql.{SaveMode, SparkSession}

val sparkConf = new SparkConf()
  .setMaster("yarn")
  .setAppName("My application")
  .set("hive.exec.dynamic.partition", "true")
  .set("hive.exec.dynamic.partition.mode", "nonstrict")
val spark = SparkSession.builder()
  .config(sparkConf)
  .getOrCreate()

val hive = HiveWarehouseBuilder.session(spark).build()
val hiveDatabase = "clean_crypt")
hive.setDatabase(hiveDatabase)
val df = hive.execute("select * from test")

df
    .write
    .format(HIVE_WAREHOUSE_CONNECTOR)
    .mode(SaveMode.Append)
    .option("partition", partitionColumn)
    .option("table", table)
    .save()

For the above, the hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar was used. If the table does not exist, the connector creates it and stores it (by default) in ORC format.

spark HWC cannot write to an existing table

2 Answers2