Table loaded through Spark not accessible in Hive

Question

Hive table created through Spark (pyspark) are not accessible from Hive.

df.write.format("orc").mode("overwrite").saveAsTable("db.table")

Error while accessing from Hive:

Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)

Table getting created successfully in Hive and able to read this table back in spark. Table metadata is accessible (in Hive) and data file in table (in hdfs) directory.

TBLPROPERTIES of Hive table are :

  'bucketing_version'='2',                         
  'spark.sql.create.version'='2.3.1.3.0.0.0-1634', 
  'spark.sql.sources.provider'='orc',              
  'spark.sql.sources.schema.numParts'='1',

I also tried creating table with other workarounds but getting error while creating table:

df.write.mode("overwrite").saveAsTable("db.table")

OR

df.createOrReplaceTempView("dfTable")
spark.sql("CREATE TABLE db.table AS SELECT * FROM dfTable")

Error :

AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table default.src failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'

Stack version details:

Spark2.3

Hive3.1

Hortonworks Data Platform HDP3.0

Table is marked as a managed table => Internal Table. In table properties, can you try to add 'transactional'='true? — pvy4917, Oct 11 '18 at 14:41
I suspect this is a bug. https://issues.apache.org/jira/browse/HIVE-20593 — Shantanu Sharma, Oct 11 '18 at 14:54

score 3 · Accepted Answer · answered Jan 21 '19 at 14:18

From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.

Below article explain the steps:

Integrating Apache Hive with Apache Spark - Hive Warehouse Connector

score 3 · Answer 2 · edited Jun 04 '19 at 08:48

3

I faced the same issue after setting the following properties, it is working fine.

set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false; 
set hive.fetch.task.conversion=none;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

edited Jun 04 '19 at 08:48

Suraj Rao

29,388
11
94
103

answered Jun 04 '19 at 08:35

Gowtham SB

332
1
3
16

which hadoop distribution you are using? and what is the version of spark and hive? – Shantanu Sharma Jun 04 '19 at 08:58
I am using HDP3, Hive-3.1.0, and Spark-3.1.0 – Gowtham SB Jun 04 '19 at 09:05
I think you mean Spark2.3.1, right? from HDP3.0 the way we access Hive table from Spark has been changed https://community.hortonworks.com/content/kbentry/223626/integrating-apache-hive-with-apache-spark-hive-war.html , are you accessing Hive tables using these steps? – Shantanu Sharma Jun 04 '19 at 09:31

Table loaded through Spark not accessible in Hive

2 Answers2

Linked