Manage file size for S3 using Spark and Alluxio

Question

I am using Spark to write data in Alluxio with UFS as S3 using Hive parquet partitioned table. I am using repartition function on Hive partition fields for making write operation efficient in Alluxio. This is resulting in creation of single file in Alluxio i.e. single object in S3 for a partition combination. Though Alluxio has the functionality to read data in bytes using offset from S3 but eventually it caches whole file/object from S3. If the file size increase to TBs it will become an overhead for Alluxio memory. Please suggest how the file size can be controlled.

score 0 · Answer 1 · answered Jul 08 '19 at 05:55

Though Alluxio has the functionality to read data in bytes using offset from S3 but eventually it caches whole file/object from S3

This statement is incorrect. Though an S3 Object can be TBs, in Alluxio all Objects are cached at the granularity of Alluxio blocks (by default 512 MB each). As a result, if your application is only touching bytes of an object, Alluxio will cache the blocks containing these bytes, rather than all blocks for this object.

Manage file size for S3 using Spark and Alluxio

1 Answers1