1

the Cloudera blog or in hortonwork forum I read::

"Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory"

BUT:

10000000 * 150 = 1500000000 byte = 1.5 GB.

Looks like For 3GB I need to allocate 300 bytes. I don't understand why 300 bytes are used for each file instead of 150? It's just NameNode. There should not be any replication factor.

Thanks

grep
  • 5,465
  • 12
  • 60
  • 112

1 Answers1

2

For every small file, namenode needs to store two objects in memory: per-file object and per-block object. This results in approximately 300 bytes per single file.

gudok
  • 4,029
  • 2
  • 20
  • 30
  • What's the difference between the file object and the block object? where can I read more information? – grep Jun 29 '19 at 05:32
  • 2
    There is no such concept as "object" in hadoop. It is just rough explanation that can be used for size estimation. For every every file, namenode needs to store its filename, access rights, list of its block ids, etc -- combined, on average it occupies 150 bytes. And for every block, namenode stores its size, status and locations in the cluster. Again, combined, all this information requires approximately 150 bytes. Hence the file of 10 blocks occupies ~(1+10)*150 = 1650 bytes of namenode memory, while 10 files of 1 block each -- (1+1)*10*150 = 3000 bytes. This is much worse. – gudok Jun 29 '19 at 05:53