Hive - Possible to get total size of file parts in a directory?

Question

Background:

I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.

Aim:

I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.

To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.

Is this possible? Are there any in-built functions or UDFs that could help me with my use case?

Thanks in advance!

score 0 · Answer 1 · answered Apr 10 '15 at 22:16

A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.

Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:

$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt

Hive - Possible to get total size of file parts in a directory?

1 Answers1