I have a folder(around 2 TB in size) in HDFS, which was created using save
method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck
).
When I try to distcp
this folder (intra-cluster), and run hdfs fsck
on the destination folder, it turns out to be highly skewed, that is, few nodes have a lot of blocks whereas few nodes have very less blocks stored on them. This skewness on HDFS is causing performance issues.
We tried moving the data using mv
from source to destination (intra-cluster), and this time the skewness in the destination was fine, that is, the data was evenly distributed.
Is there any way to reduce the skewness in HDFS when using distcp
?