Introduce HDFS folder information into Hive external table

Question

I have an HDFS directory structure like this:

/home/date_1/A/file.txt
/home/date_1/B/file.txt
/home/date_2/A/file.txt
/home/date_2/B/file.txt
...

I can create an external table

CREATE EXTERNAL TABLE table_name(col1 int, col2 string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORE AS TEXTFILE
LOCATION '/home'

But I don't know how to introduce the folder information 'A' or 'B' into the table. What can I do? Thanks!

score 0 · Accepted Answer · answered Jul 10 '20 at 12:52

In Hive you have virtual columns which you can use to read the underlying filename. INPUT__FILE__NAME will give your the list of files that the data has used to get the filename.

So you need to first create external table (as you have done). Then when you query the external table you can make use of the virtual column and split the data, as below:

select 
 col1,
 col2,
 INPUT__FILE__NAME as full_filepath,
 concat_ws("/",reverse(split(reverse(INPUT__FILE__NAME),"/")[1]), reverse(split(reverse(INPUT__FILE__NAME),"/")[0])) as splitted_filepath
 FROM
 table_name;

More on virtual column in hive.

score 0 · Answer 2 · answered Jul 10 '20 at 13:49

Are you using MapReduce as the Hive execution engine? You should be able to simply direct the framework to traverse all the sub-directories.

SET mapreduce.input.fileinputformat.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
SELECT COUNT(1) FROM table_name;

Introduce HDFS folder information into Hive external table

2 Answers2