Python HDFS gives incorrect file size

Question

I am trying to get the size of a file from hdfs using python 3.5 and hdfs library.

https://pypi.python.org/pypi/hdfs/

from hdfs.client import Client
if __name__ == '__main__':

    cl = Client("http://hostName:50070")

    print (cl.content("/path/to/file/fileName.txt",False))

i get

{'spaceQuota': -1, 'directoryCount': 0, 'spaceConsumed': 103566, 'length': 34522, 'quota': -1, 'fileCount': 1}

so as per this message, the file size is 103 KB

but when i look at http://hostName:50070/explorer.html#/path/to/file/

i see that the file size is 33.71 KB ! How is this possible? Is there another way to get the proper size for a file in hdfs? How about the size of a directory?

score 2 · Accepted Answer · answered Mar 28 '16 at 19:25

What you are seeing is correct.

Note the length parameter, which shows a value close to the 33.71KB you expect to see. Length is defined in the hadoop docs as being the number of bytes in the file. The spaceConsumed is how much disk space is taken up.

These don't necessarily agree, because of things like block size and overhead in the filesystem (I'm not familiar enough with hadoop to know the precise reason in your case)

score 1 · Answer 2 · answered Sep 02 '16 at 10:00

1

The actual file size is 33.71 KB and size on hdfs is 103 KB. The HDFS replication factor is 3, it means the file size on hdfs becomes 3 X actual_file_size.

answered Sep 02 '16 at 10:00

Vikas Ranjan

331
3
6

Python HDFS gives incorrect file size

2 Answers2