0

I currently have the following setup:

syslog-ng servers --> Logstash --> ElasticSearch

The syslog-ng servers are load balanced and write to a SAN location where Logstash just tails the files and sends them to ES. I'm currently receiving around 1,300 events/sec to the syslog cluster for the networking logs. The issue I'm running into is a gradual delay in when the logs actually become searchable in ES. When I started the cluster (4 nodes), it was dead on. Then a few minutes behind and now after 4 days it's ~35 min behind. I can confirm the logs are writing to real time on the syslog-ng servers and I can also confirm that my 4 other indexes that are using the same concept but a different Logstash instance are staying up-to-date. However, they are significantly lower (~500 events/second).

It appears the Logstash instance that is reading the flat file is not able to keep up. I've already separated these files out once and spawned 2 Logstash instances for it to help, but I'm still falling behind.

Any help would be greatly appreciated.

--

Typical input are ASA logs, mainly denies and VPN connections

Jan  7 00:00:00 firewall1.domain.com Jan 06 2016 23:00:00 firewall1 : %ASA-1-106023: Deny udp src outside:192.168.1.1/22245 dst DMZ_1:10.5.1.1/33434 by access-group "acl_out" [0x0, 0x0]
Jan  7 00:00:00 firewall2.domain.com %ASA-1-106023: Deny udp src console_1:10.1.1.2/28134 dst CUSTOMER_094:2.2.2.2/514 by access-group "acl_2569" [0x0, 0x0]

Here is my Logstash config.

input {

file {
    type => "network-syslog"
    exclude => ["*.gz"]
    start_position => "end"
    path => [ "/location1/*.log","/location2/*.log","/location2/*.log"]
    sincedb_path => "/etc/logstash/.sincedb-network"
  }
}

filter {
    grok {
      overwrite => [ "message", "host" ]
      patterns_dir => "/etc/logstash/logstash-2.1.1/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-2.0.2/patterns"
      match => [
        "message", "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %%{CISCOTAG:ciscotag}: %{GREEDYDATA:message}",
        "message", "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %{GREEDYDATA:message}"
      ]
     }
    grok {
      match => [
        "message", "%{CISCOFW106001}",
        "message", "%{CISCOFW106006_106007_106010}",
        "message", "%{CISCOFW106014}",
        "message", "%{CISCOFW106015}",
        "message", "%{CISCOFW106021}",
        "message", "%{CISCOFW106023}",
        "message", "%{CISCOFW106100}",
        "message", "%{CISCOFW110002}",
        "message", "%{CISCOFW302010}",
        "message", "%{CISCOFW302013_302014_302015_302016}",
        "message", "%{CISCOFW302020_302021}",
        "message", "%{CISCOFW305011}",
        "message", "%{CISCOFW313001_313004_313008}",
        "message", "%{CISCOFW313005}",
        "message", "%{CISCOFW402117}",
        "message", "%{CISCOFW402119}",
        "message", "%{CISCOFW419001}",
        "message", "%{CISCOFW419002}",
        "message", "%{CISCOFW500004}",
        "message", "%{CISCOFW602303_602304}",
        "message", "%{CISCOFW710001_710002_710003_710005_710006}",
        "message", "%{CISCOFW713172}",
        "message", "%{CISCOFW733100}",
        "message", "%{GREEDYDATA}"
      ]
    }
    syslog_pri { }
    date {
      "match" => [ "syslog_timestamp", "MMM  d HH:mm:ss",
                   "MMM dd HH:mm:ss" ]
      target => "@timestamp"
    }
    mutate {
      remove_field => [ "syslog_facility", "syslog_facility_code", "syslog_severity", "syslog_severity_code"]
    }
}

output {
    elasticsearch {
    hosts => ["server1","server2","server3"]
    index => "network-%{+YYYY.MM.dd}"
    template => "/etc/logstash/logstash-2.1.1/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.2.0-java/lib/logstash/outputs/elasticsearch/elasticsearch-network.json"
    template_name => "network"
 }
}
GregL
  • 9,370
  • 2
  • 25
  • 36
Eric
  • 1,383
  • 3
  • 17
  • 34
  • Can you post a sanatized copy of your logstash server's input, fiter and output configs? – TheFiddlerWins Jan 06 '16 at 19:54
  • @TheFiddlerWins - I've updated the initial question with additional information. Thanks. – Eric Jan 07 '16 at 17:48
  • How are you starting LS? More to the point, how many workers are you starting? – GregL Jan 07 '16 at 17:56
  • Nothing special. I just have an init script that spawns the Logstash bin file and then point it towards my configuration file. I have ~3 Logstash instances running on one server and 3 more on another looking at various files to split it up some. I assumed I need to add more memory or worker nodes or something to this specific one but wasn't sure the best way to do that. – Eric Jan 07 '16 at 18:19
  • 2
    It's possible to tell LS to start more workers per instance with the `-w N` command-line option, where N is a number. That should increase your event throughput. – GregL Jan 07 '16 at 21:35
  • It looks like it's almost caught up from being ~45 min behind to only ~4 this morning. I went from not defining -w to doing -w 3, so it definitely appears to have done the trick. I'm glad and sorry it was such a simple solution. :) I was expecting to have to deal more with the Java settings on Logstash. If you want to put your comment as an official answer I'd be glad to accept it. Thanks! – Eric Jan 08 '16 at 14:32
  • 2
    Damn you GregL! That was the road I was going to go down but I wanted to make sure he was not using multiline first. Glad you are working Eric – TheFiddlerWins Jan 08 '16 at 19:25

1 Answers1

2

It's possible to tell LS to start more workers per instance with the -w N command-line option, where N is a number.

That should increase your event throughput substantially.

I don't know your exact server layout, but it's probably safe to start half as many workers as you have cores on your LS boxes, but adjust that based on what other functions it's performing.

GregL
  • 9,370
  • 2
  • 25
  • 36
  • Thanks for your help.I'm definitely glad it was as simple as the -w nodes. I'll do some additional reading up on how many it recommends for my setup. – Eric Jan 11 '16 at 14:05
  • Bumping it from no -w option to -w 3 seemed to fix it initially but now I'm lagging behind by over an hour. I just bumped it up to 4 to see if that helps at all but are there any other factors that you would recommend looking into? It's a fairly basic RHEL VM but running a top doesn't seem to be bogging down the system at all. – Eric Jan 13 '16 at 03:43
  • What version of LS? – GregL Jan 14 '16 at 00:49
  • I just updated a few weeks ago, so I'm on 2.1.1. Last night with traffic lower, we caught back up to about 5 min behind, but did a max of 1 hour behind during peak hours. I bumped the -w up to like 5 earlier but for some reason that made it worse, 3 is what has worked the best so far. – Eric Jan 14 '16 at 14:12
  • I'll have to check my config, but I think I increased my LS heap settings since I had the RAM on the box. Maybe that's something you can look at next. – GregL Jan 14 '16 at 14:34
  • I was just going to check to see if you had any other settings configured. I'm consistently staying a few hours behind even bumping up my workers to 8 after i bumped up the amount of cores I had on the VM. – Eric Jan 21 '16 at 00:10
  • Is LS using all its got, CPU-wise when you've got 8 workers running? I'm actually away from the office for a few weeks, but I'll try and login to check what else I changed on my instances. – GregL Jan 21 '16 at 01:23
  • The load on the server is around 1.9 overall. The process running that logstash instance shows 333.2% CPU and 3.6 %mem on a top. – Eric Jan 21 '16 at 20:15
  • I just checked my configs and it looks like the default heap size for LS is 1GB. I have 3 instances (as outlined [here](http://serverfault.com/a/744556/266218)) on this box and it turns out I had *decreased* the heap size for the main instance (parser) from 1GB to 512MB. Not sure now why I did that, or if it will help you, but I suggest playing with LS' heap settings and letting it run 24 hour between changes to see how they affect your ingestion rate. I think the key number to check for your LS instances is whether or not they're using all their heap. by looking at the `RES` column in `top` – GregL Jan 22 '16 at 15:59