16

I want to create a Hive table where the input textfiles are traversed onto multiple sub-directories in hdfs. So example I have in hdfs:

    /testdata/user/Jan/part-0001
    /testdata/user/Feb/part-0001
    /testdata/user/Mar/part-0001
and so on...

If i want to create a table user in hive, but have it be able to traverse the sub-directories of user, can that be done? I tried something like this, but doesn't work;

CREATE EXTERNAL TABLE users (id int, name string) 
STORED AS TEXTFILE LOCATION '/testdata/user/*'  

I thought adding the wildcard would work but doesn't. When I tried not using wildcard still does not work. However, if I copy the files into the root directory of user, then it works. Is there no way for Hive to traverse to the child-directories, and grab those files?

Cœur
  • 37,241
  • 25
  • 195
  • 267
user706794
  • 201
  • 1
  • 3
  • 6

5 Answers5

25

You can create an external table, then add subfolders as partitions.

CREATE EXTERNAL TABLE test (id BIGINT) PARTITIONED BY ( yymmdd STRING);
ALTER TABLE test ADD PARTITION (yymmdd = '20120921') LOCATION 'loc1';
ALTER TABLE test ADD PARTITION (yymmdd = '20120922') LOCATION 'loc2';
OpenGG
  • 4,345
  • 2
  • 24
  • 31
  • This is an awesome response! Yeah, you have the hassle of having to update the table every time there is a new data directory (once a day in my case). But just works flawlessly. – Leonardo Brambilla May 21 '14 at 02:47
  • This is also one of the cool things about how Hive stores data. With the data stored this way, and represented as partitions, the partition data becomes a virtual column. That means when you query in a way that constrains the results to a subset of dates (in this case), Hive only has to search in a few places rather than scan the whole dataset for your answer. – agentv Mar 02 '15 at 16:14
  • This should be selected as the accepted answer - it perfectly answers the question. You've helped me out, too, so thanks for answering! – Steven Bakhtiari Nov 30 '15 at 10:22
7

I ended up using a shell script like below for a use case where the sub-directories are not known a-priori.

#!/bin/bash
hive -e "CREATE EXTERNAL TABLE users (id int, name string) PARTITIONED BY (month string) STORED AS TEXTFILE LOCATION '/testdata/user/'; "

hscript=""

for part in `hadoop fs -ls /testdata/user/ | grep -v -P "^Found"|grep -o -P "[a-zA-Z]{3}$"`;
do

echo $part
tmp="ALTER TABLE users ADD PARTITION(month='$part');"
hscript=$hscript$tmp
done;

hive -e "$hscript"
Dhanesh
  • 1,009
  • 1
  • 11
  • 16
5

Hive uses subdirectories as partitions of the data, so simply:

CREATE EXTERNAL TABLE users (id int, name string) PARTITIONED BY (month string)
STORED AS TEXTFILE LOCATION '/testdata/user/'  

That should do it for you.

Steve Severance
  • 6,611
  • 1
  • 33
  • 44
0

Don't put * after the /testdata/user/ because path hive will take all sub directories automatically.

If you want to make partitions then make the HDFS folder like /testdata/user/year=dynamicyear/month=dynamicmonth/date=dynamicdate

After creating the table with partition then use msck repair table tablename.

CREATE EXTERNAL TABLE users (id int, name string) 
STORED AS TEXTFILE LOCATION '/testdata/user/'
AbsoluteSpace
  • 710
  • 2
  • 11
  • 21
0
CREATE EXTERNAL TABLE user (id int, name string);
LOAD DATA INPATH "/testdata/user/*/*" INTO TABLE users;
Kai Zhang
  • 1,028
  • 8
  • 9