0

The below table returns no data while running a select statement

CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';

I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.

Is there any way the location be made dynamic like

/user/data/CSV/*/*/*/*/part-*

or just /user/data/CSV/* would do fine ?

(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)

Community
  • 1
  • 1

4 Answers4

1

First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.

If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.

LiMuBei
  • 2,868
  • 22
  • 27
0

Hive does not iterate through multiple folders - Hence for the above scenario I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)

hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
0

You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.

 CREATE EXTERNAL TABLE foo (
 )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\073'
 LINES TERMINATED BY '\n'
 ;
 alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
 alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
 alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
 alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;

You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.

You may also be able to leverage a

create table like

statement. To create a schema like you have in another table. Then alter the table to point at the files you want.

I know this isn't exactly what you want and is more of a work around. Good luck!

invoketheshell
  • 3,819
  • 2
  • 20
  • 35
  • Thanks invoketheshell., I'l give a shot to your idea .. Currently I am running a automated script that get the files from this folder (date "+%Y/%m/%d) put it in a separate folder which hive points - – Arvind Ravichandran Feb 04 '16 at 17:12
0

This line

     LOCATION '/user/data/CSV/2016/1/27/*/part-*';

Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?

Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.

DPEZ
  • 107
  • 2
  • 14