Liberty Admin Center shows Docker server as running while it is not

Question

I am using Liberty 16.0.0.2 on Ubuntu x64. When I use REST to deploy remote Liberty Docker containers on another host it all works, however when I stop those remote containers manually (using docker stop xxx command) the Admin Center still shows those containers as running, even after restarting the Collective Controller.

I have defined the autoscaling for my Docker containers and some of them are being stopped because of the policy, but some containers that are really running, are shown in the Admin Center as stopped. Here is the list of running containers:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
fb59f94cd25b        liberty_img         "/opt/ibm/wlp/bin/ser"   41 minutes ago      Up 41 minutes                           liberty_container11
5fd2d5858f60        liberty_img         "/opt/ibm/wlp/bin/ser"   42 minutes ago      Up 42 minutes                           liberty_container10
98117dac4f69        liberty_img         "/opt/ibm/wlp/bin/ser"   42 minutes ago      Up 42 minutes                           liberty_container9
cdce71905081        liberty_img         "/opt/ibm/wlp/bin/ser"   8 hours ago         Up 3 hours                              liberty_container6

And here is what the Admin Center shows me (note container 5 and 11):

How can this be fixed so that the controller discovers the proper state of my Docker containers?

The messages.log file is attached, but I do not see anything interesting there.

score 2 · Answer 1 · answered Jul 12 '16 at 19:34

Since you're executing the docker stop command directly, the collective member is being essentially killed, so it doesn't get to report to the controller that it's being stopped. As such, the controller reports the last known state of that server (published to the controller by that server; members push information to the controller). This same behavior would be true for non-docker Liberty members that had their process killed (instead of a 'proper' takedown).

If the docker members are instead stopped by the serverCommands mbean (via Admin Center, swagger, java, jconsole, JMX REST connector, etc), or through the '/wlp/bin ./server stop {memberName}' command, you should not see this issue since the member will first report to the controller that it is stopping. Since it pertains to your environment, I'll note that if you want to stop/start/restart an autoscaled server through Admin Center, you first need to either remove the autoScalling feature from that member, or place that member into maintenance mode.

In order to get back to the correct state of your collective, you should only have to wait until the heartbeat timeout expires three times for the members (default value for a heartbeat is 60 seconds, so 3 minutes total), after which the controller should mark them as stopped (since it hasn't heard from them in the agreed amount of time). Alternatively, you can start the members back up and then stop them 'properly' (this can be done through Admin Center by placing the stopped members into maintenance mode (sine they're autoscaled) and then selecting 'restart' which will start the member back up and then 'stop' to stop it. You can then remove maintenance mode).

I understand this logic, but I believe that after some time the server has to be marked as stopped, which does not happen in my case - even after several hours of wait. Furthermore, one of the containers that was started via Liberty REST and is happily up and running is shown as stopped - see my picture above. These are two similar, but separate issues. — Roman Kharkovski, Jul 12 '16 at 21:26
The one that's running but reported as stopped could indicate a communication error between that member and the controller (the logs/messages would have more info). For the ones reported as Running but are actually Stopped, have you tried starting them again and then stopping them through the liberty command/mbean? — M. Broz, Jul 13 '16 at 01:44
You can verify that the issue is due to the wrong member state being stored in the controller's collective by dumping it. One way to do that is by connecting jconsole to the controller & using the Websphere/CollectiveRepository/collectiveController/CollectiveRepository/Operations/dump. The dump will show the state of each resource in the collective (this should be what you're seeing in Admin Center) — M. Broz, Jul 13 '16 at 01:52
I did a dump of my Controller state - it does have a list of all of the collective members, however it does not reflect their real state. This time I did not do a manual stop on the members - I just deployed them using REST as Docker images, but from the very beginning the Controller only shows three out of total five Docker images. The Admin Center does not show two other images at all - like they do not even exist. Yet I can see them in the Controller Dump. The Controller Dump shows images as stopped, yet they are running. — Roman Kharkovski, Aug 12 '16 at 14:42

score 1 · Answer 2 · answered Aug 12 '16 at 20:45

1

It is important to note that the controller reflects the state it knows of the members. So, if a member joins, but then is unable to communicate with the controller afterward, it will show as stopped, because the controller never received information from the member to the contrary. As far as the controller is concerned, it is stopped.

Regarding the members that don't appear in AdminCenter but are in the repository, I suspect that what is in the repository is not complete. I believe there was another issue reported in which reusing the same container name was resulting in behavior like this. The join of the container to the controller actually failed because of the pre-existing data. Is this possible in your case?

answered Aug 12 '16 at 20:45

Steve Clay

181
4

To your first question - intuitively I would expect that the controller will try to reach back to the container after some time interval to see if it is back alive. It does not seem to be happening. As for hidden members - I get clean error message when I try to add containers with the duplicate name, but when something is a little wrong with server.xml in the new container - that is when it gets started, but not shown in the AdminGUI, yet shows up in controller dump. I think GUI must show everything, but perhaps with error message, instead of totally hide it. – Roman Kharkovski Aug 15 '16 at 13:26
The controller does not reach down to the members as an automated process. Users can direct this type of action, but the design of collectives is one that is loosely coupled so that members opt-in and they join the controller and heartbeat to show their liveness. Your situation seems to be unique to your environment and the steps to debug (dumping the repository) that you were working previously should help to reveal what state these hidden members are in. Thanks. – Steve Clay Aug 16 '16 at 16:12

Liberty Admin Center shows Docker server as running while it is not

2 Answers2