-1

Background:

I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.

Aim:

I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.

To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.

Is this possible? Are there any in-built functions or UDFs that could help me with my use case?

Thanks in advance!

activelearner
  • 7,055
  • 20
  • 53
  • 94

1 Answers1

0

A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.

Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:

$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt
Donald Miner
  • 38,889
  • 8
  • 95
  • 118