0

I have researched my scenario everywhere but can't find any string related to my issue. I have a datanode in Hadoop Framework , which recently went bad because all the drives on that box got umounted due to some unknown reason. These drives are mounted on directories that reside on the "/" . since the hadoop processes were still running it was writing to these directories but after the drives got unmounted it consumed all the space on root instead of separate drives which they were mounted on, so the root became full stopped the hadoop related services due to unavailability of space. Now that I mounted all the drives back and cleaned all the old data on them , my root is still showing 100%. Here is the situation:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       3.6T  3.4T  140M 100% /
tmpfs            24G     0   24G   0% /dev/shm
/dev/sda1       239M   60M  167M  27% /boot
/dev/sdb1       3.6T   15G  3.4T   1% /data-1
/dev/sdc1       3.6T   16G  3.4T   1% /data-2
/dev/sdd1       3.6T   16G  3.4T   1% /data-3
/dev/sde1       3.6T   16G  3.4T   1% /data-4
/dev/sdf1       3.6T   15G  3.4T   1% /data-5
/dev/sdg1       3.6T   15G  3.4T   1% /data-6
/dev/sdh1       3.6T   16G  3.4T   1% /data-7
/dev/sdi1       3.6T   15G  3.4T   1% /data-8
/dev/sdj1       3.6T   15G  3.4T   1% /data-9
/dev/sdk1       3.6T   15G  3.4T   1% /data-10
/dev/sdl1       3.6T   16G  3.4T   1% /data-11
cm_processes     24G  512K   24G   1% /var/run/cloudera-scm-agent/process 

I have read all the threads about process still writing to the old dir but it does not imply in my case.

[root@server /]# du -sh ./*
7.7M    ./bin
58M     ./boot
15G     ./data-1
15G     ./data-10
16G     ./data-11
16G     ./data-2
16G     ./data-3
15G     ./data-4
15G     ./data-5
15G     ./data-6
16G     ./data-7
15G     ./data-8
15G     ./data-9
264K    ./dev
30M     ./etc
18M     ./files
132K    ./home
260M    ./lib
23M     ./lib64
16K     ./lost+found
4.0K    ./media
4.0K    ./mnt
3.7G    ./opt
du: cannot access `./proc/19763/task/19763/fd/4': No such file or directory
du: cannot access `./proc/19763/task/19763/fdinfo/4': No such file or directory
du: cannot access `./proc/19763/fd/4': No such file or directory
du: cannot access `./proc/19763/fdinfo/4': No such file or directory
0       ./proc
112K    ./root
14M     ./sbin
4.0K    ./selinux
4.0K    ./srv
0       ./sys
176K    ./tmp
2.2G    ./usr
808M    ./var
[root@server /]# lsof | grep 'deleted'

This command returns nothing. Also recycled the server but no effect. Thanks for your help.

MadHatter
  • 79,770
  • 20
  • 184
  • 232
  • 1
    You might try unmounting the `/data-*` FSes, and look under the mount points: it's quite possible that someone wrote a bunch of stuff to one of those directories while the volume wasn't mounted, in which case you wouldn't currently be able to see it. – MadHatter Nov 13 '15 at 16:25
  • Thanks MadHatter, did even that but couldn't find anything except for the dir which are supposed to be there. – S.Ahmad Nov 13 '15 at 16:41
  • This doesn't directly solve your problem, however, to mitigate this in the future, consider chattr +i'ing the mount point folders. Do this *before* mounting ie. the folders actually on "/". Any processes that want to write to these paths will then be unable to do so, including root processes, because the "i" attribute makes the paths immutable. Such processes will then likely terminate (which is good). Upon successful mounting, the properties of the mounted file system overrule any previous +i attributes on the mount point folder and the entire path becomes writeable again. – parkamark Nov 13 '15 at 16:55
  • Any chance of a huge dotdir in the root dir? Could you add the results of `ls -al /`? – MadHatter Nov 13 '15 at 17:06
  • @parkamark . I like your idea, will look into it since I have to stand up 40 node cluster in the near future. – S.Ahmad Nov 13 '15 at 20:15
  • @MadHatter: this command did not reveal anything. – S.Ahmad Nov 13 '15 at 20:23
  • Could we see the output, please? – MadHatter Nov 14 '15 at 12:17

2 Answers2

0

Linux does not really delete a file if a process keep it open. If you can, reboot your machine and all lost space should be reclaimed.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • The OP already said (s)he did that (at least, I'm assuming that's what "*recycled the server*" means). – MadHatter Nov 14 '15 at 12:17
  • recycled=reboot I tried to copy the contents of the above command but the editor does not allow ,saying too big , btw I am running rhel6.5 as my OS – S.Ahmad Nov 16 '15 at 15:08
0

[root@server /]# ls -al / total 158 dr-xr-xr-x. 34 root root 4096 Nov 13 12:00 . dr-xr-xr-x. 34 root root 4096 Nov 13 12:00 .. -rw-r--r-- 1 root root 0 Nov 13 12:00 .autofsck -rw-r--r-- 1 root root 0 May 29 10:53 .autorelabel dr-xr-xr-x. 2 root root 4096 Nov 2 03:48 bin dr-xr-xr-x. 5 root root 1024 Nov 12 14:11 boot drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-1 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-10 drwxr-xr-x. 6 root root 4096 Nov 13 11:31 data-11 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-2 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-3 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-4 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-5 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-6 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-7 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-8 drwxr-xr-x. 6 root root 4096 Nov 12 14:12 data-9 drwxr-xr-x 17 root root 4220 Nov 13 12:00 dev drwxr-xr-x. 105 root root 12288 Nov 13 12:00 etc drwxr-xr-x 2 root root 4096 Nov 12 14:40 files drwxr-xr-x. 10 root root 4096 Sep 2 13:32 home dr-xr-xr-x. 11 root root 4096 Nov 1 11:27 lib dr-xr-xr-x. 9 root root 12288 Nov 2 03:48 lib64 drwx------. 2 root root 16384 May 29 10:43 lost+found drwxr-xr-x. 2 root root 4096 Jun 28 2011 media drwxr-xr-x. 2 root root 4096 Jun 28 2011 mnt drwxr-xr-x. 5 root root 4096 Sep 26 2011 opt dr-xr-xr-x 438 root root 0 Nov 13 07:00 proc dr-xr-x---. 4 root root 4096 Nov 6 15:24 root dr-xr-xr-x. 2 root root 12288 Jun 24 03:32 sbin drwxr-xr-x. 2 root root 4096 May 29 10:45 selinux drwxr-xr-x. 2 root root 4096 Jun 28 2011 srv drwxr-xr-x 13 root root 0 Nov 13 07:00 sys drwxrwxrwt. 5 root root 4096 Nov 13 15:19 tmp drwxr-xr-x. 14 root root 4096 Jun 15 14:48 usr drwxr-xr-x. 20 root root 4096 Jun 15 14:50 var