How to load data to same Hive table if file has different number of columns

Question

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee

My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and columns 5 are missing. if I directly load data last columns will be NULL NULL. but instead it should load 3rd as NULL and 5th column as NULL.

Suppose I have a table Employee and I want to load the file1.txt and file2.txt to table.

file1.txt
==========
id name sal deptid state coutry  
1  aaa  1000 01   TS   india  
2  bbb  2000 02   AP   india  
3  ccc  3000 03   BGL   india  


file2.txt  

id  name   deptid country  
1  second   001   US  
2  third    002   ENG  
3  forth    003   AUS

In file2.txt we are missing 2 columns i.e. sal and state.

we need to use the same Employee table how to handle it ?

Can you please paste sample data from both file ? – Sandeep Singh Sep 20 '16 at 16:06 — Sandeep Singh, Sep 20 '16 at 16:06

score 0 · Answer 1 · answered Sep 19 '16 at 20:07

I'm not aware of any way to create a table backed by data files with a non-homogenous structure. What you can do however, is to define separate tables for the different column configurations and then define a view that queries both.

I think it's easier if I provide an example. I will use two tables of people, both have a column for name, but one stores height as well, while the other stores weight instead:

> create table table1(name string, height int);
> insert into table1 values ('Alice', 178), ('Charlie', 185);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, NULL as weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | NULL   |
| Bob     | NULL   | 98     |
| Charlie | 185    | NULL   |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Or as a closer example to your problem, let's say that one table has name, height and weight, while the other only has name and weight, thereby height is "missing from the middle":

> create table table1(name string, height int, weight int);
> insert into table1 values ('Alice', 178, 55), ('Charlie', 185, 78);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | 55     |
| Bob     | NULL   | 98     |
| Charlie | 185    | 78     |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Be sure to use union all and not just union, because the latter tries to remove duplicate rows, which makes it very expensive.

The way I read the question I would say: If you create a table instead of a view it is probably what is requested. I understand what you are doing in the second part, but I don't think that is what the asker needs. (just a guess of course) — Dennis Jaheruddin, Sep 21 '16 at 11:54
thankyou zoltan and jaheruddin for your inputs... i have edited the question please have a look... — Raj, Sep 23 '16 at 16:41
Hi Raj, I still think my answer tells you the best way to go forward. I came up with my own schema in the example because at that time you didn't provide yours, but you can easily apply my solution to your case. In my second example `height` behaves similarly to your `sal` and `state` columns - it is present in one source but not in the other. Simply define two separate tables over your two files and use a view definition to unify them. — Zoltan, Sep 23 '16 at 17:25
Thankyou Zoltan i agree with you, but just checking is with single table it is possible ? — Raj, Sep 23 '16 at 18:06
The only way I can imagine with a single table is if you first define two tables over the files using different schemas, then use `insert into ... select ...` to populate a third table with a query similar to the one above. This combined third table then becomes a single table that contains all data, but if your input files change later on, you have to empty the combined table and populate it again. — Zoltan, Sep 23 '16 at 19:14
A less painful way of repopulating the combined table is to use a temporary table, then drop the combined table and rename the temporary table to be the new combined table, thereby reducing the time window when data is not available for queries. Or you can append only the new data if your files are organized enough so that you can identify which ones are new. — Zoltan, Sep 23 '16 at 19:18

score 0 · Answer 2 · edited May 23 '17 at 12:19

It seems like there is no way to directly load into specified columns.

As such, this is what you probably need to do:

Load data inpath to a (temporary?) table that matches the file
Insert into relevant columns of final table by selecting the contents of the previous table.

The situation is very similar to this question which covers the opposite scenario (you only want to load a few columns).

How to load data to same Hive table if file has different number of columns

2 Answers2

Linked