2

when trying to use spark 2.3 on HDP 3.1 to write to a Hive table without the warehouse connector directly into hives schema using:

spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")

fails with:

Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional

but a:

val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")

Spark with spark.sql("select * from foo.my_table_02").show works just fine. Now going to Hive / beeline:

0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)

A

 describe extended my_table_02;

returns

 +-----------------------------+----------------------------------------------------+----------+
|          col_name           |                     data_type                      | comment  |
+-----------------------------+----------------------------------------------------+----------+
| value                       | int                                                |          |
| part                        | int                                                |          |
|                             | NULL                                               | NULL     |
| # Partition Information     | NULL                                               | NULL     |
| # col_name                  | data_type                                          | comment  |
| part                        | int                                                |          |
|                             | NULL                                               | NULL     |
| Detailed Table Information  | Table(tableName:my_table_02, dbName:foo, owner:hive/bd-sandbox.t-mobile.at@SANDBOX.MAGENTA.COM, createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |

How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive? To my best knowledge external tables should be possible (thy are not managed, not ACID not transactional), but I am not sure how to tell the saveAsTable how to handle these.

edit

related issues:

Might be a workaround like the https://github.com/qubole/spark-acid like https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html but I do not like the idea of using more duct tape where I have not seen any large scale performance tests just yet. Also, this means changing all existing spark jobs.

In fact Cant save table to hive metastore, HDP 3.0 reports issues with large data frames and the warehouse connector.

edit

I just found https://community.cloudera.com/t5/Support-Questions/Spark-hive-warehouse-connector-not-loading-data-when-using/td-p/243613

And:

execute() vs executeQuery()

ExecuteQuery() will always use the Hiveserver2-interactive/LLAP as it uses the fast ARROW protocol. Using it when the jdbc URL point to the non-LLAP Hiveserver2 will yield an error.

Execute() uses JDBC and does not have this dependency on LLAP, but has a built-in restriction to only return 1.000 records max. But for most queries (INSERT INTO ... SELECT, count, sum, average) that is not a problem.

But doesn't this kill any high-performance interoperability between hive and spark? Especially if there are not enough LLAP nodes available for large scale ETL.

In fact, this is true. This setting can be configured at https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39, though I am not sure of the performance impact of increasing this value

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Did you try setting explicitly the table storage format to something non-default (i.e. non-ORC) that is not supported by Hive ACID, hence should not mess with the new ACID-by-default settings? Like Parquet, AVRO, CSV, whatever? – Samson Scharfrichter Oct 16 '19 at 16:38
  • IMHO the best way to deal with that is to disable the new "ACID-by-default" setting in Ambari. If and when you need ACID, make it explicit in the `CREATE TABLE` in Hive -- the way it was in HDP 2.x – Samson Scharfrichter Oct 16 '19 at 16:40
  • That sounds very sensible. Do you know where to change it / the key of this property? – Georg Heiler Oct 16 '19 at 17:05

4 Answers4

0

Did you try

    data.write \
        .mode("append") \
        .insertInto("tableName")
Vijesh Kk
  • 141
  • 2
  • 11
0

Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.

set to false twice (tez, llap)

hive.strict.managed.tables = false

and enable manually in each table property if desired (to use a transactional table).

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • This only works for parquet, not for ORC. In the latter case: `java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)` remains. Can this also be fixed? – Georg Heiler Oct 16 '19 at 19:15
  • Was that a new ORC table, created without the transactional props? Or an existing ORC table _(transactional cannot be reverted, ever...)_? And of course, did you restart the Metastore service? – Samson Scharfrichter Oct 16 '19 at 21:40
  • A new one. I explicitly restarted spark shell and dropped the existing tables. ambari restarted all affected services. – Georg Heiler Oct 17 '19 at 04:31
  • Did you try toggling `spark.sql.orc.impl`? Should be `hive` by default in 2.3 and `native` in 2.4+ >> Also, did you check that the new tables are indeed created "not transactional"? – Samson Scharfrichter Oct 17 '19 at 08:50
  • For branch 2.3 the source code about SparkSQL props (incl. all the un-documented stuff) is https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala – Samson Scharfrichter Oct 17 '19 at 08:51
  • Thanks for the link. Unfortunately, switching to Orc 1.4 / `native` does not solve the issue. I even restarted all the components, but still the property `transactional=true` is found when spark creates the table. How can I change / disable this behaviour of hive? – Georg Heiler Oct 17 '19 at 09:17
  • 1
    When you create the table from Hive itself, is it "transactional" or not? If not, then the trick is to inject the appropriate Hive property into the config used by Hive-Metastore-client-inside-Spark-Context. Either via a custom `hive-site.xml` in a ditrectory that's in the CLASSPATH _(that's how Hadoop libs search for their config)_ or via a custom `spark.hadoop.*` property passed to Spark and then injected automagically into the Hadoop props, overriding the default config files. – Samson Scharfrichter Oct 17 '19 at 10:04
  • Indeed, when hive is creating the table all by itself the `transactional=true` is already added. Which properties would I need to set to get this trick to work? – Georg Heiler Oct 17 '19 at 10:18
  • I'm trying to do something similar @GeorgHeiler would really appreciate any thoughts you might have on this problem – Topde Aug 04 '21 at 17:57
  • well most of my thoughts are already laid out here. In general, there is no ideal way to bridge hive3 ACID tables and spark. All possibilities have certain trade-offs. You could perhaps completely resort to delta lake and have hive-read-only tables that at least work well for their ACID properties. – Georg Heiler Aug 04 '21 at 18:00
0

Creating an external table (as a workaround) seems to be the best option for me. This still involves HWC to register the column metadata or update the partition information.

Something along these lines:

val df:DataFrame = ...
val externalPath = "/warehouse/tablespace/external/hive/my_db.db/my_table"
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
dxx.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath)
val columns = dxx.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ")
val ddl =
      s"""
         |CREATE EXTERNAL TABLE my_db.my_table ($columns)
         |PARTITIONED BY (part_col string)
         |STORED AS ORC 
         |Location '$externalPath'
       """.stripMargin

hive.execute(ddl)
hive.execute(s"MSCK REPAIR TABLE $tablename SYNC PARTITIONS")

Unfortunately, this throws a:

java.sql.SQLException: The query did not generate a result set!

from HWC

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Though, I would prefer (as – Samson Scharfrichter suggests) to reconfigure hive to not put the `transactional` property by default. However, I could not find a solution to do this. – Georg Heiler Oct 17 '19 at 13:39
0

"How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?"

We are working on the same setting (HDP 3.1 with Spark 2.3). Using below code we were getting the same error messages as you got "bucketId out of range: -1". The solution was to run set hive.fetch.task.conversion=none; in Hive shell before trying to query the table.

The code to write data into Hive without the HWC:

  val warehouseLocation = new File("spark-warehouse").getAbsolutePath

  case class Record(key: Int, value: String)

  val spark = SparkSession.builder()
    .master("yarn")
    .appName("SparkHiveExample")
    .config("spark.sql.warehouse.dir", warehouseLocation)
    .enableHiveSupport()
    .getOrCreate()

  spark.sql("USE databaseName")
  val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
  recordsDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("sparkhive_records")

[Example from https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]

Michael Heil
  • 16,250
  • 3
  • 42
  • 77