Filename as columns - hadoop

Question

I have log files with that contain the date and hour in the file name. Is there a way to extract date & hour from the filename to add extra columns in hive, an example of the file is weblogs-20150101-010000.gz.

The method that I know is to sequentially append the date and hour to each line using a map only job, but I am trying to see if there an easier method via hadoop streaming.

Alex Woolford · Answer 1 · 2015-05-25T21:48:58.917

0

If query performance is important and you'll be filtering by the date/hour, you could partition the data by placing the files in folders that contain the date attributes, e.g. /path/to/your/data/year=2015/month=05/day=25/hour=14/, then add those partitions to the Hive table.

Another approach would be to use Hive's INPUT__FILE__NAME virtual column and filter using that, e.g.

SELECT * FROM WEBLOGS WHERE INPUT__FILE__NAME LIKE '%20150101-010000.gz'

edited May 25 '15 at 21:48

answered May 25 '15 at 21:18

Alex Woolford

4,433
11
47
80

hey Alex, thank you for you comment. The number of files that I have is close to 10,000. Is there an easy way to arrange them into folders? – macha May 25 '15 at 21:20
You could write a script to copy them into the appropriate folders. Perhaps someone else has a better solution. – Alex Woolford May 25 '15 at 21:22

Filename as columns - hadoop

1 Answers1