I am trying to figure out how best to monitor usage of our HPC resources. Specifically, I am trying to identify cpu usage, disk space consumed, and number of jobs run by group.
The pbs format allows the "-W" group_list flag to identify the group the script belongs to. I want to use this to monitor the cluster usage, but I can't find documentation on how to track this over time.
gmond and gmetric offer some functionality - I can see the parameters I'm interested in, but I can't figure out how to group these by the -W group_list flag or by user or some other metric.
Any advice?