Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

Question

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements). However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.

Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore. Hence, I tried the below options :

Changing the spark setting: spark.sql.hive.convertMetastoreParquet=false and Refreshing the spark catalog: spark.catalog.refreshTable("table_name")

However, the above two options are not solving the problem.

Any suggestions or alternatives would be super helpful.

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:

...Interestingly enough it appears that if you create the table differently like:

spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")

Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")

Then it works properly...

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 29 '19 at 12:51

mazaneicha

8,794
4
33
52

I see...but there would be a separate Hive process that would independently create Alter table commands .That might not work out so easily in our case – user2717470 Jul 01 '19 at 21:16
Wonder ..if there would be an option to dynamically create a create table command in spark based on the dataframe .that might help ..i guess . – user2717470 Jul 01 '19 at 21:16
Also , the JIRA mentions ..using df.write.format("hive") would solve the problem ..But that is also not working even in spark 2.3 version . – user2717470 Jul 01 '19 at 21:18
Can you alter your table using SparkSQL instead of Hive? What happens then? – mazaneicha Jul 02 '19 at 20:04
2

it works ...if i use spark SQL . I used the dataframe to dynamically generate the create table statement and then loaded the data through Spark . After i used the create table statement instead of the Datsource API(saveAsTable) , I was able to read new columns which were added to the hive table using HIve Alter table. – user2717470 Jul 03 '19 at 04:45

score 0 · Answer 2 · answered May 17 '22 at 09:55

0

To fix this solution, you have to use the same alter command used in hive to spark-shell as well.

spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

answered May 17 '22 at 09:55

Maneesh K Bishnoi

131
1
7

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

2 Answers2