Import data from .avro files to hive table

Question

I created a hive table by following command and avro schema i had.

CREATE TABLE table_name
PARTITIONED BY (t string, y string, m string, d string, h string, hh string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://location/schema.avsc');

Now i want to dump data i have in HDFS to created table.

I have an HDFS location, where i have data in a directory structure as t/y/m/d/h/hh/data.avro I have multiple directories according to partition because that are my partition columns.

I want to dump all the data into created table.

I tried using External table, but it is giving exceptions.

score 4 · Accepted Answer · answered Jan 25 '17 at 13:16

if you follow the hive convention in your hdfs folder and create the hive table pointing to the table location, you should run the msck repair table statement.

For example

CREATE TABLE table_name
PARTITIONED BY (t string, y string, m string, d string, h string, hh string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://location/schema/schema.avsc')
location "hdfs:///location/data;

and load the data like

/location/data/y=2016/m=02/d=03/h=03/hh=12/data.avro /location/data/y=2016/m=02/d=03/h=03/hh=13/data2.avro

in that way you will be able to load the data with the following statement because Hive will recognize the partitions

msck repair table table_name;

if you don't want to do this, you can use the add partition like

ALTER TABLE table_nameADD PARTITION (y='01',m='02',d='03',h='03',hh='12') location '/yourpath/y=2016/m=02/d=03/h=03/hh=12/'; (or any other folder)

Andrea · Answer 2 · 2017-01-26T08:07:12.633

You could insert the LOCATION 'path_hdfs' command during the query for the creation of an external table, or using the command LOAD DATA LOCAL INPATH 'path_hdfs' OVERWRITE INTO TABLE table_name; using a physical table.

UPDATE:

As asked by OP, I add more info about the partitioning. You have to specify the command PARTITIONED BY (day string) (example for a 'day' variable of datatype 'string') during the creation of the query. For a full example, refer to the answer provided by lake. Then, if your data on creation already has following values for 'day':

day = 2017-11-02
day = 2017-11-03
day = 2017-11-04

When you run the command MSCK REPAIR TABLE <table>, partitions for those three values will be added. The next day, assuming you receive data for day = 2017-11-05, when you run MSCK REPAIR TABLE <table> will be added a new partition for the new value:

day = 2017-11-05

Physically, adding a partition will organize data into different folders on HDFS. You need writing permissions on your data in order to create a partition. But if you already have partitions, just check that the folder system will have each folder named in the following format: "day=2017-11-02". In this way, when you run MSCK REPAIR command, the partitions will be loaded as metadata automatically. I always worked with external tables, and I used this pipeline flawlessly.

I can't use LOCATION 'path_hdfs' in create external table, because i don't have write permission on disk. — KrunalParmar, Jan 25 '17 at 18:47
when i am using LOAD DATA command, i am getting following exception,,, FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned....How can i specify partition ? — KrunalParmar, Jan 25 '17 at 18:48
You have to declare partitions during the creation query, and then run periodically the command MSCK REPAIR to update partitions after you have imported new data (with new values on the partition side). For example, if you partition by day, every day you have to run the command MSCK REPAIR to add the new day to the partitions. Now I update my answer with the partitioning syntax. — Andrea, Jan 26 '17 at 07:56

score 1 · Answer 3 · answered Apr 11 '17 at 11:56

The below syntax will be helpful.

CREATE EXTERNAL TABLE table_name
    PARTITIONED BY (part string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    WITH SERDEPROPERTIES (
    'avro.schema.url'='hdfs:///path/to/avro/schema/')
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 'path/to/location'

    ALTER TABLE table_name ADD PARTITION (part='part_name') LOCATION 'location'

score 1 · Answer 4 · answered Mar 29 '19 at 08:17

You can mount the avro data in hive using the avro schema:

CREATE TABLE dummy_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///pathtoschema/schema.avsc');

Coming to your question about loading the data in dummy_table, you can do this by :

LOAD DATA INPATH 'hdfs:///pathtoinputdata/inputdata' OVERWRITE INTO TABLE dummy_table;

Import data from .avro files to hive table

4 Answers4