0

We have an ibm HPC cluster, with two admin nodes (rhel 6.4x64), they are both attached to an NFS server for shared objects, like home directories.

A few days ago, we had an incident and now we have a strange problem on one of the admin servers.

the problem : When I log into the affected admin server as a normal user (not root), the /etc/profile, /etc/profile.d/*.sh, /etc/bashrc, .bashrc and .bash_profile are not executed. I end up with a limited shell, with no PS1 (just -bash-4.1$), the PATH variable is small (/usr/local/bin:/bin:/usr/bin), the env command show only a few variables :

-bash-4.1$ env
TERM=xterm
SHELL=/bin/bash
SSH_CLIENT=10.81.234.8 42548 22
SSH_TTY=/dev/pts/0
USER=testuser4
MAIL=/var/mail/testuser4
PATH=/usr/local/bin:/bin:/usr/bin
PWD=/home/testuser4
LANG=fr_FR.UTF-8
SHLVL=1
HOME=/home/testuser4
LOGNAME=testuser4
SSH_CONNECTION=10.81.234.8 42548 172.16.33.201 22
_=/bin/env

But when using root user, no problem. And if I source the /etc/profile as a simple user on the affected server, it works and I get back the whole environment.

On the second admin server, all is fine, root or simple user.

[testuser4@hpcadmin2 ~]$ echo $PATH
/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/pcm/bin:/opt/pcm/sbin:/opt/pcm/web-portal/gui/3.0/bin:/opt/pcm/web-portal/perf/1.2/bin:/usr/bin:/bin:/usr/local/bin:/local/bin:/sbin:/usr/sbin:/usr/ucb:/usr/sbin:/usr/bsd:/shared/ibm/platform_lsf/9.1/linux2.6-glibc2.3-x86_64/etc:/shared/ibm/platform_lsf/9.1/linux2.6-glibc2.3-x86_64/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin::/home/testuser4/bin

No error messages are shown in /var/log/messages, I am stuck, I didn't find any useful solution in the net, and I don't understand why only simple users are affected.

I did verify access wrights, the size of these files, and they are all the same.

Regards.

Wodel
  • 51
  • 1
  • 5

1 Answers1

0

The problem is resolved, it was a file permissions problem.

Searching for the cause, I lunched the login command (login -p) as a simple user on both servers, on the first one, it died with errors, in the second it prompted me to login, this was the trigger, I compared the file permissions on both servers, and I discovered the problem.

Almost all files in /bin have their permissions broken, I compared with the second node, then I did a little shell script to correct the situation, and voila, problem solved.

Regards.

Wodel
  • 51
  • 1
  • 5