Hive External table on AVRO file producing only NULL data for all columns

Question

I am trying to create an Hive external table on top of some avro files which are generated using spark-scala. I am using CDH 5.16 which has hive 1.1, spark 1.6.

I created hive external table, which ran successfully. But when i query the data i am getting NULL for all the columns. My problem is similar to this

Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.

I am pretty new to avro file type. Can some one please help me out here.

Below is my spark code snippet where i have saved the file as avro:

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

Below is my hive external table create statement:

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

Below is the result i am getting when i query the data: select * from prod_order_avro

At the same time, when i am reading these avro files using spark-scala as dataframe and printing them, i am getting proper result. Below is the spark code i used to read these data:

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

My question is,

While creating these avro files, do i need to change my spark
code to create schema files separately or will it be embedded with
the files. If needs to be separate, then how to achieve it?
If not how to create hive table so that schema is retrieved from the file automatically. I read that in latest version hive takes care of this issue by itself if schema is present in the files.

Kindly help me out here

Vaishak · Accepted Answer · 2019-07-19T11:33:30.683

2

Resolved this..it was a schema issue. The schema was not embedded with the avro files.So i had to extract schema using avro-tools and passed it while creating table. Its working now.

I followed the below steps:

Extracted few data from avro files stored in hdfs into a file in local system. Below is the command used for the same:

sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
Used avro-tools getschema command to extract schema from this data:

avro-tools getschema /path/temp.txt
Copy the resulting schema(it will be in the form of json data) into a new file with .avsc extension and upload the same into HDFS
While creating the Hive External table add the below property to it:

TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')

edited Jul 19 '19 at 11:33

answered Jul 19 '19 at 06:02

Vaishak

607
3
8
30

@Vaishak- Could you please update your answer with some more details. can be useful for a new person. – vikrant rana Jul 19 '19 at 06:20
1

@vikrantrana i have edited my answer with more details – Vaishak Jul 19 '19 at 11:34

Hive External table on AVRO file producing only NULL data for all columns

1 Answers1