Collectd --> Elasticsearch if remote-host can't connect to central elasticsearch

Question

Goal

Central storage and way to analyze performance numbers:

cpu load
ram usage
...

Current strategy

I would like to implement a setup like this:

collectd
logstash
elasticsearch
kibana

Like explained here: https://mtalavera.wordpress.com/2015/02/16/monitoring-with-collectd-and-kibana/

Problem: remote-host can't push data

Constraints:

We only have ssh from the central server to the remote-host.
ssh from remote-hosts to central server does not work because of the network setup (something I unfortunately can't change).
the network traffic crosses several non public networks. Twice every month a host can't be reached because an admin plays with the firewall rules. I don't want to loose a single line. That's why I want to store the logs on the remote-host and fetch the (compressed) data.

Solution?

How can I fetch the data every hour?

Questions for you - what do you mean by `ssh` "does not work"? Does it not work because you are not authorized to access it? :) Does it not work because it is a figment of your imagination? — tacos_tacos_tacos, Apr 24 '16 at 07:32
@tacos_tacos_tacos I updated the question: " * `ssh` from remote-hosts to central server does not work because of the network setup (something I unfortunately can't change)." — guettli, Apr 24 '16 at 07:36
I figured as much but I guess perhaps in the future if you are not going to share details of something or do not know why something is the case, just say that instead so that people do not think you are being intentionally vague — tacos_tacos_tacos, Apr 24 '16 at 07:51
So the question you are asking is very broad. You include a link to a turoail so that;s astart, I guess is your only question... "How do I get the data from the host I am monitoring to logstash?" — tacos_tacos_tacos, Apr 24 '16 at 07:52
@tacos_tacos_tacos yes the question is broad because I want to avoid a xy-problem. That's why I describe the overall goal and the missing part in my current strategy. I am willing to switch strategy, since I want to solve the bigger problem. Do you understand this? If not, please ask! — guettli, Apr 24 '16 at 08:22
I am going to add a section or two on how to do your own TCP input but this shoul dbe plenty to get you in the right idrection — tacos_tacos_tacos, Apr 24 '16 at 08:33
Hey buddy, no problem. Question: what kind of thing are you reporting on? Do you know? For example Syslogd, Event logs in WIndowz, cat , jdbc, ,,, — tacos_tacos_tacos, Apr 24 '16 at 08:35
Why not just initiate a persistent SSH connection from the `central-server` out to each `remote-host` with [autossh](http://www.harding.motd.ca/autossh/) and use port redirection so that the collectd instances on `remote-host` can push data directly into LS. — GregL, Apr 25 '16 at 14:27
@GregL I updated the question: "the network traffic crosses several non public networks. Once every month a host can't be reached because an admin plays with the firewall rules. I don't want to loose a single line. That's why I want to store the logs on the remote-host and fetch the (compressed) data." What does collectd do, if the ssh-tunnel is dead? — guettli, Apr 25 '16 at 14:38
It would likely fail and drop the stats, but I'm not really sure since I've never used it. The theory still works, but you just need to add buffering on the `remote-host` end. That doesn't complicate things *that* much. How long would end-to-end connectivity be lost; seconds, minutes, hours? — GregL, Apr 25 '16 at 14:46
Is each `remote-host` by itself, or are there more than one at a given site? — GregL, Apr 25 '16 at 17:04
@GregL each remote-host is on his own in a far away location. Sometimes there are two, but I don't want to introduce a special case for this. Up to now we have pets, not cattle :-) — guettli, Apr 26 '16 at 08:30

GregL · Accepted Answer · 2016-04-28T15:59:28.910

With the problems you list above, you'll need to buffer the stats at the remote end so that nothing is lost.

There's a number of ways to do this, but none are overly simple and will take lots of testing to make sure they're viable. They all involve writing collectd's output locally, then using some method to get that on the Central Server.

I haven't test any of the below, so some might not work at all.

In no particular order of ease or complication:

Socket/Network Output to Script
Write Collectd's output to a socket or IP/port, where a PHP/Perl/Python/Bash script is listening to write the commands to a file.

Those files can then be pushed to/pulled by the central server and ingested by Logstash.

Pros: Simple script to capture the output; standard Linux commands used
Cons: Not scalable if you're pulling lots of stats; need to maintain script; not sure if LS will handle plain protocol
Redis/AMQP/Kafka/MongoDB Write Collectd's output to one of the possible "buffers". They each work a little differently, and have different deployment options, so I'll leave to you to figure out which is best, since that's out of scope for this question. That said, any of them should work.

You'd then need a method to get the data from your buffer solution back to the Central Server. Application native Replication/Mirroring/Clustering or a script that runs every X interval to ship the data (run at either end) are two possibilities.

Pros: Very flexible deployment options; should scale very well; uses well known tools/programs
Cons: Buffer progam might need lots of resources, or many packages installed
Socket/Network Output to Logstash This is almost the same as option 1, but instead of having collectd output to a script/progarm, you have it write to a local Logstash instance on each Remote-Host.

Logstash would then write to CSV/JSON locally, and you can use any means to get those files back to the Central Server, including LS itself.

Pros: Single set of tools for whole solution; provides a way to transform data at the edge, then just ingest centrally; very few moving parts Cons: need Java/LS on all remote hosts

In addition to each options pros/cons, the single common downside to all of them is that you'd need to find a way to maintain consistent configs on all the servers. If you have lots of remote nodes (or just lots of nodes in general) you might already have a Configuration Management System in place and this will be trivial.

We have custom scripts which work on log files. In the past I loved files, since it is easy to debug with command lines tools (grep, cut, python, ...). But it has drawbacks. I think it is time to use a message bus. At the moment I think AMQP could fit ... But I am unsure. It looks big ... and scary :-) — guettli, Apr 28 '16 at 13:00
A little bit of Google-Fu tells me that RabbitMQ (AMQP) will do federation, whereby the `Central Server` could federate all the `remote-hosts` and see all their messages, thereby giving you really good visibility. The Federation plugin [docs](https://www.rabbitmq.com/federation.html) even say *The federation plugin uses AMQP 0-9-1 to communicate between brokers, and is designed to tolerate intermittent connectivity.* — GregL, Apr 28 '16 at 17:41

tacos_tacos_tacos · Answer 2 · 2016-04-24T14:46:48.520

edit: Achtung! Warning!

Please use this docker-compose instead of the one I linked (it does require docker and compose and maybe machine but it has done more for you and you will have to struggle less.

CELK: https://github.com/codenamekt/celk-docker-compose/blob/master/logstash/logstash.conf

Also

So start here to get a nice overview of a working system. They've done some of the work for you so you have to worry only about your question, which pertains to configuration and deployment.

Even if you don't end up using Docker this will still get you on the track to success and have the added benefit of showing you how it fits together.

Get Vagrant first and build the Vagrant image w/ vagrant up

If you don't know what Vagrant is, it's wonderful. It's a program that allows people to share entire set of virtual machines and provisioners so that you can define a VM and its configuration only, rather than sharing the entire VM, and it "just works." It feels magical, but it is really just solid systems work.

(use link above to CELK): https://github.com/pblittle/docker-logstash

You will need to install Vagrant to use this. Just do it! Then, you don't have to install docker because it will run on a Vagrant VM.

You have four choices of how you want to use it, but first, get Vagrant ready with the command in bold....

`vagrant up`

Decide which programs you need to run

Your options are:

Full suite or ELK (elasticsearch, logstash, kibana)
Agent only (Logstash collector)
Kibana only

There are other configurations available but for testing only.

Showtime

Now it is time to configure Logstash, which is really the only part that has complex behavior.

Logstash configuration files are plain text files that end in conf and optionally can be grouped together using a tar or gunzipgz.

You get the config files in one of two ways:

you download them from the Internet, using the environment variable LOGSTASH_CONFIG_URL to point to the url of your config, and **if you get the url wrong or there is some problem and it cannot get a config from the url, it falls back to a knonw url or else
read them from the disk, kind of - since this is docker, you will actually be creating a volume one time (now) and you will mount the volume each time you run the container.

Here is what it looks like when you run using a config from the Internet:

$ docker run -d \
  -e LOGSTASH_CONFIG_URL=https://secretlogstashesstash.com/myconfig.tar.gz \
  -p 9292:9292 \
  -p 9200:9200 \
  pblittle/docker-logstash

The author of the docker warns you:

The default logstash.conf only listens on stdin and file inputs. If you wish to configure tcp and/or udp input, use your own logstash configuration files and expose the ports yourself. See logstash documentation for config syntax and more information.

Note: What is the default logstash conf?

Recall it is the file you get when you don't put in a correct URL for the required environment variable LOGSTASH_CONFIG_URL

This is the input section:

// As the author warned, this is all you get. StdIn and Syslog file inputs.

input {
  stdin {
    type => "stdin-type"
  }

  file {
    type => "syslog"
    path => [ "/var/log/*.log", "/var/log/messages", "/var/log/syslog" ]
  }

  file {
    type => "logstash"
    path => [ "/var/log/logstash/logstash.log" ]
    start_position => "beginning"
  }
}

Beyond default

Read more about logstash on the website.

Now logstash has plugins that push data into input. The plugins vary exactly as you would expect; here are a few:

s3 from amazon (file system events)
stdin from logstash (default, reads the stdin buffer)
http from logstash (your guess)
...etc...

Example: UDP sockets

UDP is a connectionless and fast protocol that operates at the bottom of L4 (transport) and supports multiplexiing, handles failures, and generally is a good choice for logging data transmission.

You pick the port that you want; other options depend on what you are doing .

TCP works the same way.

udp { port => 9999 codec => json buffer_size => 1452 }

Example 2: UDP sockets from `collectd` filtered and output

This is stolen from https://github.com/codenamekt/celk-docker-compose/blob/master/logstash/logstash.conf

input {
  udp {
    port => 25826         # 25826 matches port specified in collectd.conf
    buffer_size => 1452   # 1452 is the default buffer size for Collectd
    codec => collectd { } # specific Collectd codec to invoke
    type => collectd
  }
}
output {
  elasticsearch {
    host => elasticsearch
    cluster  => logstash
    protocol => http
  }
}

And the filtering is a great example: That is, it is really long and I think it does stuff

filter {
  # TEST implementation of parse for collectd
  if [type] == "collectd" {
    if [plugin] {
      mutate {
        rename => { "plugin" => "collectd_plugin" }
      }
    }
    if [plugin_instance] {
      mutate {
        rename => { "plugin_instance" => "collectd_plugin_instance" }
      }
    }
    if [type_instance] {
      mutate {
        rename => { "type_instance" => "collectd_type_instance" }
      }
    }
    if [value] {
      mutate {
        rename => { "value" => "collectd_value" }
      }
      mutate {
        convert => { "collectd_value" => "float" }
      }
    }
    if [collectd_plugin] == "interface" {
      mutate {
        add_field => {
          "collectd_value_instance" => "rx"
          "collectd_value" => "%{rx}"
        }
      }
      mutate {
        convert => {
          "tx" => "float"
          "collectd_value" => "float"
        }
      }
      # force clone for kibana3
      clone {
        clones => [ "tx" ]
      }
      ##### BUG EXISTS : AFTER clone 'if [type] == "foo"' NOT WORKING : ruby code is working #####
      ruby {
        code => "
          if event['type'] == 'tx'
            event['collectd_value_instance'] = 'tx'
            event['collectd_value'] = event['tx']
          end
        "
      }
      mutate {
        replace => { "_type" => "collectd" }
        replace => { "type" => "collectd" }
        remove_field => [ "rx", "tx" ]
      }
    }
    if [collectd_plugin] == "disk" {
      mutate {
        add_field => {
          "collectd_value_instance" => "read"
          "collectd_value" => "%{read}"
        }
      }
      mutate {
        convert => {
          "write" => "float"
          "collectd_value" => "float"
        }
      }
      # force clone for kibana3
      clone {
        clones => [ "write" ]
      }
      ##### BUG EXISTS : AFTER clone 'if [type] == "foo"' NOT WORKING : ruby code is working #####
      ruby {
        code => "
          if event['type'] == 'write'
             event['collectd_value_instance'] = 'write'
             event['collectd_value'] = event['write']
          end
        "
      }
      mutate {
        replace => { "_type" => "collectd" }
        replace => { "type" => "collectd" }
        remove_field => [ "read", "write" ]
      }
    }
    if [collectd_plugin] == "df" {
      mutate {
        add_field => {
          "collectd_value_instance" => "free"
          "collectd_value" => "%{free}"
        }
      }
      mutate {
        convert => {
          "used" => "float"
          "collectd_value" => "float"
        }
      }
      # force clone for kibana3
      clone {
        clones => [ "used" ]
      }
      ##### BUG EXISTS : AFTER clone 'if [type] == "foo"' NOT WORKING : ruby code is working  #####
      ruby {
        code => "
          if event['type'] == 'used'
            event['collectd_value_instance'] = 'used'
            event['collectd_value'] = event['used']
          end
        "
      }
      mutate {
        replace => { "_type" => "collectd" }
        replace => { "type" => "collectd" }
        remove_field => [ "used", "free" ]
      }
    }
    if [collectd_plugin] == "load" {
      mutate {
        add_field => {
          "collectd_value_instance" => "shortterm"
          "collectd_value" => "%{shortterm}"
        }
      }
      mutate {
        convert => {
          "longterm" => "float"
          "midterm" => "float"
          "collectd_value" => "float"
        }
      }
      # force clone for kibana3
      clone {
        clones => [ "longterm", "midterm" ]
      }
      ##### BUG EXISTS : AFTER clone 'if [type] == "foo"' NOT WORKING : ruby code is working #####
      ruby {
        code => "
          if event['type'] != 'collectd'
            event['collectd_value_instance'] = event['type']
            event['collectd_value'] = event[event['type']]
          end
        "
      }
      mutate {
        replace => { "_type" => "collectd" }
        replace => { "type" => "collectd" }
        remove_field => [ "longterm", "midterm", "shortterm" ]
      }
    }
  }
}

edit 3: I probably shouldn't be doing your work for you, but that's ok.

collectd like any good software ENCAPSULTAES certain aspects that are ugly or difficult for users to deal with, and tries to make it easy for you in that it look slike you are sending data (a tuple in this case) instead of fooling with serialization.

Your example:

(date_time, current_cpu_load), for example ('2016-0-04-24 11:09:12', 12.3)

I'm not going to spend my time figuring out how you are forming that. If you are able to get that data using CPU plugin, great. I'm going to copy and paste one I found online to make it easy for me.

That said, think about it ... just a little bit, it won't hurt.

You see the CPU plugin is loaded below.

You see the interface for collectd in the conf file is too small to specify fields.

So if you just do this, it will work , but you will get much much much more data than just CPU load.

That's where you can use a filter. But you can also do that in Kibana I think . So I'd rather not waste tim writing a filter you a) don't need and b) could easily write if you spent some time.

## In `collectd`:
# For each instance where collectd is running, we define 
# hostname proper to that instance. When metrics from
# multiple instances are aggregated, hostname will tell 
# us were they came from.
Hostname "**YOUR_HOSTNAME**"


    # Fully qualified domain name, false for our little lab
    FQDNLookup false

    # Plugins we are going to use with their configurations,
    # if needed
    LoadPlugin cpu

    LoadPlugin df
    <Plugin df>
            Device "/dev/sda1"
            MountPoint "/"
            FSType "ext4"
            ReportReserved "true"
    </Plugin>

    LoadPlugin interface
    <Plugin interface>
            Interface "eth0"
            IgnoreSelected false
    </Plugin>

    LoadPlugin network
    <Plugin network>
            Server "**YOUR.HOST.IP.ADDR**" "**PORTNUMBER**"
    </Plugin>

    LoadPlugin memory

    LoadPlugin syslog
    <Plugin syslog>
            LogLevel info
    </Plugin>

    LoadPlugin swap

    <Include "/etc/collectd/collectd.conf.d">
            Filter ".conf"
    </Include>

Your logstash config

    input {
      udp {
        port => **PORTNUMBER**         # 25826 matches port specified in collectd.conf
        buffer_size => **1452**   **# 1452 is the default buffer size for Collectd**
        codec => collectd { } # specific Collectd codec to invoke
        type => collectd 
      }
    }
    output {
      elasticsearch {
        cluster  => **ELASTICSEARCH_CLUSTER_NAME** # this matches out elasticsearch cluster.name
        protocol => http
      }
    }

an update from aaron at collectd

In Logstash 1.3.x, we introduced the collectd input plugin. It was awesome! We could process metrics in Logstash, store them in Elasticsearch and view them with Kibana. The only downside was that you could only get around 3100 events per second through the plugin. With Logstash 1.4.0 we introduced a newly revamped UDP input plugin which was multi-threaded and had a queue. I refactored the collectd input plugin to be a codec (with some help from my co-workers and the community) to take advantage of this huge performance increase. Now with only 3 threads on my dual-core Macbook Air I can get over 45,000 events per second through the collectd codec!

So, I wanted to provide some quick examples you could use to change your plugin configuration to use the codec instead.

The old way: input { collectd {} } The new way:

input {   udp {
port => 25826         # Must be specified. 25826 is the default for collectd
buffer_size => 1452   # Should be specified. 1452 is the default for recent versions of collectd
codec => collectd { } # This will invoke the default options for the codec
type => "collectd"   } } This new configuration will use 2 threads and a queue size of 2000 by default for the UDP input plugin. With
this you should easily be able to break 30,000 events per second!

I have provided a gist with some other configuration examples. For more information, please check out the Logstash documentation for the collectd codec.

Happy Logstashing!

Wow, a lot of text ... But I don't get it. Imagine collectd creates this tupple (date_time, current_cpu_load), for example ('2016-0-04-24 11:09:12', 12.3). How does this data travel - step-by-step - to the central elastic search server? — guettli, Apr 24 '16 at 09:10
That's what logstash is for :) It collects logs with `input` , filters them wiht `filer` and sends them to WHERE EVER, INCLUDING `es` with `output` — tacos_tacos_tacos, Apr 24 '16 at 10:51
Did you run the example I sent you? It is literally answering your quesiton w/o doing it for you, becaues I don't have whatever soruce of log s you're tryign to send — tacos_tacos_tacos, Apr 24 '16 at 10:52
BTW the tuple you sent me, so ok ill ansewr that in the edit — tacos_tacos_tacos, Apr 24 '16 at 10:52
I take it that you did not run the docker compose I sent to you or you are still running it. That will answer your question better than anyone could because you will see it working live. I posted an edited answer right above I dont know the stnax for collectd so you will need to add a filter if you ont want the other daata coming there. — tacos_tacos_tacos, Apr 24 '16 at 11:11
I prefer to solve things like this: First understand it, then do it. I have not run you docker compose because I have not understood it. I don't want the data flow to be encapsulated - I want to understand it. Where does logstash run in your solution? On the remote-host side or on the central server side? I guess there needs to be a cron-job on the central side which gets the data from remote every N minutes. How does this data fetching work? — guettli, Apr 24 '16 at 14:26
With respect, you ought to try reading instead asking. You may not want the data flow to be encapsulated, but most of us who do this for a living understand that encapsulation is your friend, but if you insist on knowing, collectd uses something called `protocol buffers` that it relabels as its "binary protocol" https://github.com/logstash-plugins/logstash-codec-collectd/blob/master/lib/logstash/codecs/collectd.rb — tacos_tacos_tacos, Apr 24 '16 at 14:40
Thats how the data gets serialized. As for cron, NO. This is not 1995. — tacos_tacos_tacos, Apr 24 '16 at 14:40
Sorry to be rude but if you are using cron that widely you probably should be embracing more encapsulation in your life :) polling is inefficient, we all know that instinctively, but what it does is it has an agent on the collectd side, namely ... `collectd` — tacos_tacos_tacos, Apr 24 '16 at 14:41
collectd sends a udp stream to the logstash collector in a binary format as described above, and its encrypted as well — tacos_tacos_tacos, Apr 24 '16 at 14:42
Finally, whether the log stash is remote or with ES is irrelevant. 95% of the time it will be with ES. But who cares? Thats why networks exist. If the latency is reasonable and the bandwidth is sufficient, it can be timbuktu — tacos_tacos_tacos, Apr 24 '16 at 14:48
If you have any further questions please open them in chat and invite me. Otherwise good luck. — tacos_tacos_tacos, Apr 24 '16 at 14:49
@guettli is right, this *is* a lot of text, and while I'm certain it works, it's not exactly clear or concise. I also don't think it fully addresses OPs questions about how to do it if `remote-host` can't connect to `central server`. — GregL, Apr 25 '16 at 14:29
@GregL then this should not have the words `logstash` or `collectd` in it, and should instead be: "How can I connect to a remote host behind a firewall whose rule are set to disallow the connection?" :/ — tacos_tacos_tacos, Apr 25 '16 at 20:35
@tacos_tacos_tacos, I mostly agree, but since the `logstash` and `collectd` offer so many ways to implement, it's probably worth saying what he's trying to do, and with what. The question probably would have been out on hold/closed had OP not mentioned them. — GregL, Apr 25 '16 at 20:55
Fair enough. It's hard to accurately detect someone's tone on the information superhighway, but I wasn't liking whatever it was. That combined with the lack of provided information or extensive effort probably clouded my ability to read clearly what he was asking — tacos_tacos_tacos, Apr 25 '16 at 22:59