9

I have data organized in directories in a particular format (shown below) and want to add these to hive table. I want to add all data of 2012 directory. All below names are directory names, and the inner most dir (3rd level) has the actual data files. Is there any way to pick in the data directly without having to change this dir structure. Any pointers are appreciated.

/2012/
|
|---------2012-01
            |---------2012-01-01
            |---------2012-01-02
            |...
            |...
            |---------2012-01-31
|
|---------2012-02
            |---------2012-02-01
            |---------2012-02-02
            |...
            |...
            |---------2012-02-28
|
|---------2012-03
|...
|...
|---------2012-12

Queries tried so far without luck:

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/path/to/data/2012/*/*'; 

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
partitioned by (ystr string, ymstr string, ymdstr string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

ALTER TABLE sampledata
ADD 
PARTITION (ystr ='2012') 
LOCATION '/path/to/data/2012/';

SOLUTION: This small parameter fixes my issue. Adding to the question where it might be beneficial for others:

SET mapred.input.dir.recursive=true;
Yash Sharma
  • 1,674
  • 2
  • 16
  • 23
  • Which one of the above CREATE TABLE syntax helped you with this. I am facing the same issue. My data is distributed in multiple directories. – pratpor Jun 09 '16 at 05:49
  • Use the first create query over the topmost data dir – Yash Sharma Jun 09 '16 at 06:32
  • Did wild card in LOCATION worked for you? I can't get that to work. It creates schema but retrieves no results for me when I query table. – nir Aug 03 '17 at 22:38

4 Answers4

11

Answering my own question with solution that works for my case. SET mapred.input.dir.recursive=true;

Yash Sharma
  • 1,674
  • 2
  • 16
  • 23
1
ALTER TABLE sampledata
ADD 
PARTITION (ystr ='2012', ymstr='2012-01', ymdstr='2012-01-01') 
LOCATION '/path/to/data/2012/2012-01/2012-01-01';
pensz
  • 1,871
  • 1
  • 13
  • 18
  • This will give me data only for the 2012-01-01 directory. I need all the data in the super-parent dir 2012, for all the days for 2012. – Yash Sharma Dec 24 '13 at 07:29
  • When you query `select * from sampledata where ystr = '2012'`, hive will use all sub directory as input. – pensz Dec 24 '13 at 09:12
1
SET hive.mapred.supports.subdirectories=true;
SET mapred.input.dir.recursive=true;
  • Please spend time tor read this [how-to-answer](http://stackoverflow.com/help/how-to-answer). Only post the code is not the best answer. – thewaywewere May 06 '17 at 19:48
0

The following worked on hortonworks

alter table .... set blproperties (
    "hive.input.dir.recursive" = "TRUE",
    "hive.mapred.supports.subdirectories" = "TRUE",
    "hive.supports.subdirectories" = "TRUE",
    "mapred.input.dir.recursive" = "TRUE");
enjoy
  • 1
  • 1