0

I have a hive table to which data gets added every day. So, around 5 files get added each day. Now we ended up having 800 part files under this table.

The issue i have is joining or using this table anywhere is triggering 800 mappers, as mappers are proportional to the number of files.

But i do have to use the entire table for my jobs running.

Is there way to use the entire table but not triggering too many mappers?

Files look like below

-rw-rw-r--   3 XXXX hdfs     106610 2015-12-15 05:39   /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_1.deflate
-rw-rw-r--   3 XXXX hdfs     106602 2015-12-23 12:31 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_10.deflate
-rw-rw-r--   3 XXXX hdfs     157686 2016-03-06 05:20 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_100.deflate
-rw-rw-r--   3 XXXX hdfs     163580 2016-03-07 05:22 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_101.deflate
sushma
  • 1

1 Answers1

0

I would prefer to partition the table so that the data is stored in the partition directories and whenever queried, only the files under the partitions are accessed and so are the mappers that get triggered in the hive queries when that partition columns are used.

Other option is to bucket the table using CLUSTER BY clause to distribute the data into fixed no. of bucketed directories and reducing the no. of directories and hence files that are accessed while querying.

SrinR
  • 923
  • 7
  • 16
  • 27