4

On a VMWare ESXI 5.0.0 (vSphere Hypervisor - the free version) I have three server images. All running CentOS 6 - Linux. All are configured to run the apcupsd ( http://www.apcupsd.org/ ) daemon for controlling APC upses.

One of the servers (master) is connected, using a USB cable to an APC CS 350 UPS. apcupsd is configured to have the netserver available on port 3551.

The two other (also virtualized) servers have apcupsd configured to retrieve the UPS status from master.

It works, but i i see lots of warnings coming from apcupsd on the two slaves. In a terminal window I see entries saying

Broadcast message from root@slavehostname (Thu Nov 1 19:55:10 2012):

Warning communications lost with UPS masterhostname

Broadcast message from root@slavehostname (Thu Nov 1 19:55:47 2012):

Communications restored with UPS masterhostname

On the same day I see about 200 sets of lost/restored messages. They are a lot more frequent during the day than during the night.

I don't get any warnings on the master.

These servers have lots of memory and CPU available to them. Practically no swapping taking place. I don't think that they are starved. And generally they do not do very much work.

This is the master configuration settings (leaving out the EPROM settings):

UPSCABLE usb
UPSTYPE usb
DEVICE
POLLTIME 10
LOCKFILE /var/lock
SCRIPTDIR /etc/apcupsd
PWRFAILDIR /etc/apcupsd
NOLOGINDIR /etc
ONBATTERYDELAY 6
BATTERYLEVEL 5
MINUTES 3
TIMEOUT 0
ANNOY 300
ANNOYDELAY 60
NOLOGON disable
KILLDELAY 0
NETSERVER on
NISIP 0.0.0.0
NISPORT 3551
EVENTSFILE /var/log/apcupsd.events
EVENTSFILEMAX 10
UPSCLASS standalone
UPSMODE disable
STATTIME 0
STATFILE /var/log/apcupsd.status
LOGSTATS off
DATATIME 0

And this is the slave settings:

UPSCABLE ether
UPSTYPE net       
DEVICE 192.168.0.59:3551
POLLTIME 10
LOCKFILE /var/lock
SCRIPTDIR /etc/apcupsd
PWRFAILDIR /etc/apcupsd
NOLOGINDIR /etc
ONBATTERYDELAY 12
BATTERYLEVEL 10
MINUTES 7
TIMEOUT 0
ANNOY 300
ANNOYDELAY 60
NOLOGON disable
KILLDELAY 0
NETSERVER on
NISIP 0.0.0.0
NISPORT 3551
EVENTSFILE /var/log/apcupsd.events
EVENTSFILEMAX 10
UPSCLASS standalone
UPSMODE disable
STATTIME 20
STATFILE /var/log/apcupsd.status
LOGSTATS off
DATATIME 0

I would like to ask for help on how to move on from here. How do I debug this? Any suggestions on how I might have configured my servers in a way that could cause this.

Jbruntt
  • 43
  • 1
  • 5

5 Answers5

3

This doesn't fix the underlying problem, but it helps clean up the console a bit:

The script that outputs these messages is called apccontrol, and in my Ubuntu 12.04.02 LTS boxen it lives in /etc/apcupsd. It uses wall for all the messages.

But it also calls other scripts if they exist in that directory to do secondary handlings, like emailing root every time there's a comms failure. You can turn that off by moving the script or changing it.

Also: if the other script exits with status code 99, then apccontrol will not call the default action, and you won't get spam on your wall.

I've just used it to push all the comms loss alerts into syslog instead of wall, and now it doesn't clutter up all my terminals that I'm trying to use. And I can put the polltime back down to the default of 60 so my slave box will still notice if the UPS kicks in.

jpwarren
  • 46
  • 2
  • Specifically, in `apccontrol` I replaced the line `WALL=wall` with `WALL=logger` and now the messages appear in `/var/log/messages` (thankfully). – Atafar Nov 30 '16 at 14:46
1

I know that this is an old post but my experience may be of some use...

I originally powered my server through an APC BackUPS 650CS. This always worked well.

I upgraded to an APC BX1100CI-MS. This setup gave many problems - 'Communications lost' messages on the slave machine, apcaccess often took five seconds, or more, to produce its output. Another oddity is that apcupsd reported 'power lost/power restored' status about three time a second, for several seconds, when the power went off. Worst of all, this setup demanded a battery change every two to three months. APC swapped the complete unit three times before surrendering and giving me a BackUPS Pro BR1200 in exchange.

This new setup has not produced a single 'Communications lost' message, only generates a single 'power lost' message, and apcaccess produces instantaneous output. I wait to see how the battery lasts.

My suspicion is that later APC models have changed the control protocol slightly and apcupsd doesn't cope.

Peter Bell
  • 11
  • 1
1

I have a similar problem on a slave which has a bad network connection to the master. So we are getting a lot of "Communication with UPS lost/restored" emails.

Since the underlying network problem cannot be fixed quickly in our case, I modified the scripts which send the emails, so that they don't send it if the interruption is short enough to be ignored.

The 2 scripts I changed are /etc/apcupsd/commfailure and /etc/apcupsd/commok.

This is /etc/apcupsd/commfailure :

#!/bin/sh
#
# This shell script if placed in /etc/apcupsd
# will be called by /etc/apcupsd/apccontrol when apcupsd
# loses contact with the UPS (i.e. the serial connection is not responding).
# We send an email message to root to notify him.
#

HOSTNAME=`hostname`
MSG="$HOSTNAME Communications with UPS $1 lost"

wait=12

# Wait $wait seconds, and only send email
# if still not online
sleep $wait
status=$(/sbin/apcaccess status)

if echo "$status" | grep -q '^STATUS.*COMMLOST' ; then
    logger -t apcupsd "commfailure for over $wait seconds. Sending mail"
    (
        echo "$MSG for over $wait seconds"
        echo "$status"
    ) | $APCUPSD_MAIL -s "$MSG" $SYSADMIN
    # ensure mail will also be sent on restore
    rm -f /run/apcupsd.nomail
else
    touch /run/apcupsd.nomail
    logger -t apcupsd "commfailure for less than $wait seconds. Created nomail file, and not sending any email."
fi

exit 0

And this is /etc/apcupsd/commok :

#!/bin/sh
#
# This shell script if placed in /etc/apcupsd
# will be called by /etc/apcupsd/apccontrol when apcupsd
# restores contact with the UPS (i.e. the serial connection is restored).
# We send an email message to root to notify him.
#

# wait at least as long as the wait defined in commfailure
# so that we know if we need to send mail or not

wait=13

sleep $wait

if [ -f /run/apcupsd.nomail ]; then
    logger -t apcupsd "Skip sending mail on restore (found nomail set by commfailure)"
    exit
fi

HOSTNAME=`hostname`
MSG="$HOSTNAME Communications with UPS $1 restored"
#
(
   echo "$MSG"
   echo " "
   /sbin/apcaccess status
) | $APCUPSD_MAIL -s "$MSG" $SYSADMIN
exit 0
mivk
  • 4,004
  • 3
  • 37
  • 32
0

Experiencing the same. Looks like some bug in apcupsd. Try increasing POLLTIME in slave, this will dramatically decrease the error rate.

Andrey Regentov
  • 483
  • 1
  • 4
  • 11
0

In the config file of the server connected to the UPS

"UPSCLASS standalone"

should probably be

"UPSCLASS netmaster"