Spark SQL dataframe.save with partitionBy is creating an array column

Asked Sep 29 '16 at 08:54

Active Sep 29 '16 at 08:54

Viewed 1,339 times

I am trying to save the data of a Spark SQL dataframe to hive. The data that is to be stored should be partitioned by one of the columns in the dataframe. For that I have written the following code.

val conf = new SparkConf().setAppName("Hive partitioning")
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

val df = hiveContext.sql("....   my sql query ....")

df.printSchema()
df.write.mode(SaveMode.Append).partitionBy("<partition column>").saveAsTable("orgs_partitioned")

The dataframe is getting stored as table with single column called col and of type array<string>, the structure is as shown below(Screenshot from Hue).

Any pointers are very helpful. Thanks.

asked Sep 29 '16 at 08:54

Sai Krishna

You are just dumping some Spark data to Parquet files, without defining any Hive-compliant schema first. Try to use SparkSQL i.e. "registerTempTable" then plain SQL queries like "CREATE TABLE ..." then "INSERT ... SELECT ..." – Samson Scharfrichter Sep 29 '16 at 17:32
Thank you @Samson, its working now. – Sai Krishna Sep 30 '16 at 03:17

Spark SQL dataframe.save with partitionBy is creating an array column

0 Answers0