2

I am trying to get the size of a file from hdfs using python 3.5 and hdfs library.

https://pypi.python.org/pypi/hdfs/

from hdfs.client import Client
if __name__ == '__main__':

    cl = Client("http://hostName:50070")

    print (cl.content("/path/to/file/fileName.txt",False))

i get

{'spaceQuota': -1, 'directoryCount': 0, 'spaceConsumed': 103566, 'length': 34522, 'quota': -1, 'fileCount': 1}

so as per this message, the file size is 103 KB

but when i look at http://hostName:50070/explorer.html#/path/to/file/

i see that the file size is 33.71 KB ! How is this possible? Is there another way to get the proper size for a file in hdfs? How about the size of a directory?

AbtPst
  • 7,778
  • 17
  • 91
  • 172

2 Answers2

2

What you are seeing is correct.

Note the length parameter, which shows a value close to the 33.71KB you expect to see. Length is defined in the hadoop docs as being the number of bytes in the file. The spaceConsumed is how much disk space is taken up.

These don't necessarily agree, because of things like block size and overhead in the filesystem (I'm not familiar enough with hadoop to know the precise reason in your case)

mbrig
  • 929
  • 12
  • 16
1

The actual file size is 33.71 KB and size on hdfs is 103 KB. The HDFS replication factor is 3, it means the file size on hdfs becomes 3 X actual_file_size.

Vikas Ranjan
  • 331
  • 3
  • 6