I am trying to create an Hive external table
on top of some avro
files which are generated using spark-scala
. I am using CDH 5.16
which has hive 1.1
, spark 1.6
.
I created hive external table
, which ran successfully. But when i query the data i am getting NULL
for all the columns.
My problem is similar to this
Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.
I am pretty new to avro
file type. Can some one please help me out here.
Below is my spark
code snippet where i have saved the file as avro
:
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
Below is my hive external table create statement:
create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';
Below is the result i am getting when i query the data:
select * from prod_order_avro
At the same time, when i am reading these avro
files using spark-scala
as dataframe
and printing them, i am getting proper result.
Below is the spark
code i used to read these data:
val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
My question is,
- While creating these
avro
files, do i need to change myspark
code to create schema files separately or will it be embedded with
the files. If needs to be separate, then how to achieve it? - If not how to create
hive
table so that schema is retrieved from the file automatically. I read that in latest version hive takes care of this issue by itself if schema is present in the files.
Kindly help me out here