25

I have a hive external tables that mapped to some directory. This directory includes a several files.

I want to run query like find file name where there is a user "abc"

 select file_name , usr from usrs_tables where usr = "abc"

But of course the data doesn't includes file name inside.

In MapReduce I can do it by

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());

How can I do it in Hive?

Julias
  • 5,752
  • 17
  • 59
  • 84

3 Answers3

57

Yes, you can retrieve the file the record was found in using the virtual column named INPUT__FILE__NAME, for example:

select INPUT__FILE__NAME, id, name from users where ...;

yields something like:

hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users1.txt    2    user2
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users2.txt    42    john.doe

If necessary, use the provided string functions to trim the host and directories from the uri.

You can find the documentation on virtual columns here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

Nigel Tufnel
  • 11,146
  • 4
  • 35
  • 31
jkovacs
  • 3,470
  • 1
  • 23
  • 24
  • Thanks a lot!. It is very helpful! Especially, if we search something out of huge number of log files created as external hive table, we will get to know the matching file names to get further insights. – Sakthivel Oct 20 '16 at 17:30
  • 1
    @jkovacs is there an equivalent of INPUT__FILE__NAME in Presto? – Swaranga Sarma Feb 07 '17 at 10:00
8

Every table in Hive has two virtual columns. They are

  1. INPUT__FILE__NAME
  2. BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME gives the name of the file. BLOCK__OFFSET__INSIDE__FILE is the current global file position. Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from Hive versions above 0.8. A small example is given below.

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;
Martin Zabel
  • 3,589
  • 3
  • 19
  • 34
Maheswaran M
  • 81
  • 1
  • 1
4

Hive 0.8.0 provides support for two virtual columns:

One is INPUT__FILE__NAME, which is the input file's name for a mapper task.

the other is BLOCK__OFFSET__INSIDE__FILE, which is the current global file position.

For block compressed file, it is the current block's file offset, which is the current block's first byte's file offset.

Since Hive 0.8.0 the following virtual columns have been added:

  • ROW__OFFSET__INSIDE__BLOCK
  • RAW__DATA__SIZE
  • ROW__ID
  • GROUPING__ID

Updated link on virtual columns in hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

sumitya
  • 2,631
  • 1
  • 19
  • 32
david1977
  • 739
  • 6
  • 6