External Table in Hive - Location

Question

The below table returns no data while running a select statement

CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';

I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.

Is there any way the location be made dynamic like

/user/data/CSV/*/*/*/*/part-*

or just /user/data/CSV/* would do fine ?

(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)

score 1 · Answer 1 · answered Jan 27 '16 at 14:44

1

First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.

If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.

answered Jan 27 '16 at 14:44

LiMuBei

2,868
22
27

I didn't mention the table columns here,- is it possible Hive can access multiple folders without partitioning? – Arvind Ravichandran Jan 27 '16 at 17:04
No idea. You would need to try. – LiMuBei Jan 28 '16 at 09:10
Thanks LiMuBei - I just sorted out this as mentioned above.. would welcome any other methods by which this can be improved – Arvind Ravichandran Feb 03 '16 at 13:14

score 0 · Answer 2 · answered Feb 03 '16 at 13:12

Hive does not iterate through multiple folders - Hence for the above scenario I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)

hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>

invoketheshell · Answer 3 · 2016-02-03T18:12:47.713

You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.

 CREATE EXTERNAL TABLE foo (
 )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\073'
 LINES TERMINATED BY '\n'
 ;
 alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
 alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
 alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
 alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;

You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.

You may also be able to leverage a

create table like

statement. To create a schema like you have in another table. Then alter the table to point at the files you want.

I know this isn't exactly what you want and is more of a work around. Good luck!

Thanks invoketheshell., I'l give a shot to your idea .. Currently I am running a automated script that get the files from this folder (date "+%Y/%m/%d) put it in a separate folder which hive points - — Arvind Ravichandran, Feb 04 '16 at 17:12

score 0 · Answer 4 · answered Feb 03 '16 at 17:17

This line

     LOCATION '/user/data/CSV/2016/1/27/*/part-*';

Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?

Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.

External Table in Hive - Location

4 Answers4