0

I have log files with that contain the date and hour in the file name. Is there a way to extract date & hour from the filename to add extra columns in hive, an example of the file is weblogs-20150101-010000.gz.

The method that I know is to sequentially append the date and hour to each line using a map only job, but I am trying to see if there an easier method via hadoop streaming.

macha
  • 7,337
  • 19
  • 62
  • 84

1 Answers1

0

If query performance is important and you'll be filtering by the date/hour, you could partition the data by placing the files in folders that contain the date attributes, e.g. /path/to/your/data/year=2015/month=05/day=25/hour=14/, then add those partitions to the Hive table.

Another approach would be to use Hive's INPUT__FILE__NAME virtual column and filter using that, e.g.

SELECT * FROM WEBLOGS WHERE INPUT__FILE__NAME LIKE '%20150101-010000.gz'
Alex Woolford
  • 4,433
  • 11
  • 47
  • 80
  • hey Alex, thank you for you comment. The number of files that I have is close to 10,000. Is there an easy way to arrange them into folders? – macha May 25 '15 at 21:20
  • You could write a script to copy them into the appropriate folders. Perhaps someone else has a better solution. – Alex Woolford May 25 '15 at 21:22