What is the best way to build own system metric collector agent

Question

Myself have an idea to build own metric collection agent for linux systems with various customised features and controls. Would like to know what is the best practice to collect metrics continuous from a linux system.

Is it best to use infinite while loop with sleep inside for required time interval of data collection ? or any other best method available for recursive data collection without wasting system memory much.
If i want to collect multiple metrics, like CPU util, memory util, disk util etc. What is the best way to execute all commands in parallel ? is it good way to use & and leave it for background and collect all process ids and verify all are completed ? or any other best way is present which for this purpose ?

Thanks in Advance.

Sooo, why not roll existing solutions? Zabbix, nagios? `Is it best to` What is used to measure "best"ness? Most probably not, as `sleep` will sleep a little bit more than required. Could be more accurate to use OS-specific tools for executing tasks at specific intervals. I mean `timer_create()`. But that depends on what is considered as "best". `What is the best way` What is used to measure "best"ness? There are no "best" or "worse", it all depends. As such, I believe your question is too broad. Kindly see [ask] and I recomend https://meta.stackoverflow.com/q/260648/9072753 — KamilCuk, Nov 11 '21 at 10:02
Thanks for the response. 1. The CPU,Mem etc i have given as example to explain my use case. In the actual scenario the collecting metric may vary which may not available on existing solutions like Nagios etc. 2. Best means, the best method to follow. The reason behind that question is, this is an agent running on the OS infinitely. So i would required low CPU & Memory consuming agent to do all my operation. If the while loop keep on adding data to ram, then i would not recommend while loops. This is what i actually meant by the best. Lower compute consuming. — Jithin C V, Nov 11 '21 at 10:21
`which may not available on existing solutions like Nagios` All have "custom metrics" or something like that. `Best means, the best method to follow` Does not answer the question. How to _measure_ "best"? How _measure_ what is best to follow? The best __in my opinion__, is not reinvent the wheel. If you _really_ need custom semantics, use zabbix-agent2 source code and modify it to your needs. If you just need custom metric, I see no value in rolling custom solution, as it will be costly and eat significant amount of workhours with no value. It would be more valuable to use existing solutions. — KamilCuk, Nov 11 '21 at 10:30

score 0 · Answer 1 · answered Nov 11 '21 at 10:57

Lower compute consuming

Use C programming language, or write in assembly. Generally, the answer is: the lower the better, the less the better. I am assuming C programming language below in the answer.

Is it best to use infinite while loop with sleep inside for required time interval of data collection ?

Use OS-specific interface to execute an action periodically. timer_create(). Calling nanosleep() in a loop would require to compute time difference to be accurate, which would require getting current time, which is costly. Depending on kernel is better.

In an interrupt handler, only set a single sig_atomic_t flag in the signal handler. Asynchronically wait in a loop with pause() for events.

What is the best way to execute all commands in parallel ?

To minimize "compute consuming", do not call fork(), do not create threads. Use one thread with one big loop with poll() call waiting for all the events if the events are asynchronous. Such an approach quickly results in spaghetti code - take very special care to properly structure and modularize your code.

open() all /proc/** /sys/** interfaces that you need to monitor, periodically lseeking them and reading again, when needing to send data.

So overall, in very pseudocode:

void timer_callback(int) {
   flag = 1;
}
int main() {
    metrics_read = 0; // keep count of asynchronous reads

    timer_create();
    foreach(metric) {
        int usage = open("/proc/stat", O_NONBLOCK); // for example
    }

    while(1) {
       r = pselect(...);
       if (FD_ISSET(socket_to_send_data)) {
           // take action, in case like socket was closed or smth
       }
       if (FD_ISSET(usage)) {
           parse(usage_buffer); // parse events as they come
           metrics_read++;
       }
       // FD_ISSET etc. for each metric

      if (EINTR && flag) {
          flag = 0;
          foreach(metric) {
              lseek(usage, SEEK_SET, 0)
              read(usage, usage_buffer); // non blocking, each read with custom buffer, to let kernel do the job
          }
       }

       if (metrics_read == ALL_METRICS_CNT) {
           send_metrics(); // also asynchronous on `socket()` `write()` with O_NONBLOCK
           metrics_read = 0;
       }
 }

Do not write any logs. Logs cause I/O operations, which is "compute consuming". Do not output anything. Also, special work needs to be taken with pselect to mask signals to "guarantee" that the flag is always properly parsed on time.

is it good way to use & and leave it for background and collect all process ids and verify all are completed ?

Definitely not - fork() is a very "compute consuming" function, spawning processes is very costly. Best would be not to leave anything "for background" and execute everything in a single threaded single process.

or any other best way is present which for this purpose ?

The lower "compute consuming" would be of course writing a kernel module that does the job. Then, you can specifically manage kernel resources to achieve the lowest possible "compute consuming" while keeping your system linux compatible.

What is the best way to build own system metric collector agent

1 Answers1