0

I have created a new DataStax Enterprise Cluster that is managed using OpsCenter. All versions used are the latest available from the package repository. The agents have been installed and everything is working perfectly, including RAM Usage, CPU Load, etc. I have added over 90 GB to this cluster without a problem and the hosts can support a lot more..

It is clearly an OpsCenter / DataStax-Agent issue from what I can see. I do not see a relevant line in the log files of either OpsCenter or DSA. Other clusters in the same OpsCenter instance work without a problem.

Any idea on what might be the problem?

Storage Capacity not working

Update #1: The df(1) output in a host is:

Filesystem     Type     1K-blocks     Used Available Use% Mounted on
udev           devtmpfs  16440732        4  16440728   1% /dev
tmpfs          tmpfs      3290304      652   3289652   1% /run
/dev/sda6      ext4     921095148 33460384 840822760   4% /
none           tmpfs            4        0         4   0% /sys/fs/cgroup
none           tmpfs         5120        0      5120   0% /run/lock
none           tmpfs     16451516        0  16451516   0% /run/shm
none           tmpfs       102400        0    102400   0% /run/user
/dev/sda1      ext2        240972    67121    161410  30% /boot

and in an other host is:

Filesystem     Type     1K-blocks     Used Available Use% Mounted on
udev           devtmpfs  16367904        4  16367900   1% /dev
tmpfs          tmpfs      3275852      728   3275124   1% /run
/dev/md1       ext4     958985688 92799452 817449468  11% /
none           tmpfs            4        0         4   0% /sys/fs/cgroup
none           tmpfs         5120        0      5120   0% /run/lock
none           tmpfs     16379256        0  16379256   0% /run/shm
none           tmpfs       102400        0    102400   0% /run/user
/dev/md0       ext3       1014680   105884    856420  12% /boot

Output of https://<host>:<port>/<Cluster-Name>/storage-capacity:

{"free_gb": 0, "used_gb": 0, "reporting_nodes": 3}
DaKnOb
  • 577
  • 4
  • 17
  • Sorry can you provide df output of `df --print-type --no-sync --local` – Chris Lohfink Mar 16 '16 at 15:11
  • What does navigating to `http://://storage-capacity` show on browser/curl? An example from my configuration: `http://localhost:8888/Test_Cluster/storage-capacity`, which outputs: `{free_gb: 398, used_gb: 66, reporting_nodes: 1}` – Joel Quiles Mar 25 '16 at 20:26
  • @quilesbaker Post edited. It shows up 0. It's not a UI issue then.. – DaKnOb Mar 27 '16 at 14:05
  • 1
    Exactly. There's no exception on the backend either (at least at that level)- that would show reporting nodes as 0. Will definitely keep you posted as soon as I'm able to reproduce on my setup. – Joel Quiles Mar 28 '16 at 13:35
  • @quilesbaker Thanks a lot.. Maybe try with software RAID-0 or weird RAID Controllers? Maybe try with multiple partitions (/boot, /, /test). It should fail at some point. – DaKnOb Mar 29 '16 at 19:09
  • 1
    There's a bug in OpsCenter @DaKnOb. If you run `df `, you should get a different filesystem than if you run `df --print-type --no-sync --local`. That's what I believe causes the bug. In my case, where I'm able to replicate, `df /home//random-folder` yields `/dev/disk/by-uuid/` under the filesystem/mounted on column. – Joel Quiles Apr 05 '16 at 20:13
  • 1
    For a temporary fix, while we fix this for next release, make sure you mount (on grub?) your drive used for the data using a label instead of uuid. That is, if your issue is caused by this, of course. Both `df` disk labels/output must match (for now). – Joel Quiles Apr 05 '16 at 22:36
  • The disks are all mounted `by-uuid`. You are right. Never thought this could cause problems with OpsCenter. Feel tree to post this as an answer so I can accept it.. :-) – DaKnOb Apr 06 '16 at 09:13

2 Answers2

1

The Data Size metric is the value returned as the nodes load (same as under "Load:" when doing nodetool info).

Storage capacity actually checks the disk usage, on Linux using df (this probably doesn't work at all in some versions of Windows, so if using Windows thats your issue). There have been a number of issues with this, but the most recent versions have some fixes so make sure your on new version. Check in the agents logs (/var/log/datastax-agent/agent.log) for something along the lines of Process failed which may give more details.

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • Thanks for the reply! The `nodetool info` command returns indeed a load which can be seen in the output as well as in OpsCenter too. Regarding the use of `df` it shows the disk output normally. The cluster is using Linux. The log is fine (basically GETs) and doesn't contain anything with the "process" word in it. OpsCenter and DSE are in the latest version. – DaKnOb Mar 16 '16 at 09:58
  • can you include your `df` output in the question? sometimes things like fuse or network shares can mess it up – Chris Lohfink Mar 16 '16 at 13:57
1

There's a bug in the agent. If you run df <file>, you should get a different filesystem than if you run df --print-type --no-sync --local. In my case, where I'm able to replicate, df /home/<user>/random-folder yields /dev/disk/by-uuid/<uuid> under the filesystem column.

This is due to mounting your drive (to boot with grub/lilo) using by-uuid instead of a label. Both df labels/output must match.

It will be fixed in the next release.

For a temporary fix, while we fix this for next release, make sure you mount your drive used for the data using a label instead of uuid, and verify that these df outputs match.

Joel Quiles
  • 811
  • 1
  • 9
  • 16