1

Every time I use hadoop fs -ls /path_to_directory or hadoop fs -ls -h /path_to_directory , the result is like the following

drwxr-xr-x   - hadoop supergroup          0 2016-08-05 00:22/user/hive-0.13.1/warehouse/t_b_city
drwxr-xr-x   - hadoop supergroup          0 2016-06-15 16:28/user/hive-0.13.1/warehouse/t_b_mobile

The size of directory inside HDFS is always shown as 0 no matter there is file wihin it or not.

Browsing from the web UI gives the same reuslt as following :

drwxr-xr-x  hadoop  supergroup  0 B 0   0 B t_b_city
drwxr-xr-x  hadoop  supergroup  0 B 0   0 B t_b_mobile

However, there are actually files within those directory. When using command hadoop fs -du -h /user/hive-0.13.1/warehouse/ , the directory size can be shown correctly as the following:

385.5 K   /user/hive-0.13.1/warehouse/t_b_city
1.1 M     /user/hive-0.13.1/warehouse/t_b_mobile

Why would the hadoop fs -ls command of hdfs and the web UI always show 0 for a directory ?

Also, the hadoop fs -ls command usually finish immediately while the hadoop fs -du would take sometime to execute. It seems that the hadoop fs -ls command doesn't actually spend time on calculating total size of a directory.

Heyang Wang
  • 360
  • 2
  • 4
  • 19
  • When you run a `ls -l` command on Linux, the "size" displayed for directories is not related to the size of the files inside. So why did you expect HDFS to work differently??? – Samson Scharfrichter Aug 16 '16 at 15:53
  • BTW, the NameNode stores the whole filesystem information in RAM and not on disk, therefore a directory entry requires zero bytes on disk. On the other hand Linux filesystems require a few disk segments to persist each directory *(list of `inodes`, permissions etc)* – Samson Scharfrichter Aug 16 '16 at 15:57
  • Thanks. Seems my understanding for the ls command have long been wrong. I took it for granted that ls will show size for both file and directory. – Heyang Wang Aug 16 '16 at 16:25
  • Again, the size of a directory is the size of the **directory object**. Just like the size of a file is the size of a file. Full stop. – Samson Scharfrichter Aug 16 '16 at 17:08

2 Answers2

2

It is working as designed. Hadoop is designed for big files and one should not expect it to give the size of each and every time one run hadoop fs -ls command. If Hadoop works in way you want then try to think from another person point of view who might just want to see whether directory exists or not; but end up waiting long time just because Hadoop is calculating size of folder; not so good.

abhiieor
  • 3,132
  • 4
  • 30
  • 47
  • Your explanation makes sense and I rechecked the description for -ls command at [link](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#ls). The description only say the command will give size for a file but not directory. – Heyang Wang Aug 15 '16 at 13:23
1

try to do the wild card with the du option so that all the files under a db are listed with the file sizes. The only catch here is that we need to go for multiple levels of wilcard pattern match so that all the levels under the parent directory is covered.

hadoop fs -du -h /hive_warehouse/db/*/* > /home/list_du.txt