Baselining internal network traffic (corporate)

Question

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.

What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.

My question is: Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.

score 4 · Answer 1 · answered Apr 20 '20 at 16:57

What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?

Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?

Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.

Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.

Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.

There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.

Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

Author of VAST here. Nice and exhaustive answer. We built VAST for this kind of use case. The actual analysis can and should be done downstream, e.g., in Python, Spark, R or your analysis tool of choice. VAST has native support Zeek TSV and Corelight-JSON-style logs. Via Apache Arrow, you can effectively use any analytics tool with it. We built such a graph using NetworkX in a short PoC. — mavam, Apr 20 '20 at 18:41
This is a great start. The use case here is: "Tell me when an IP address starts behaving differently." Behaving differently can mean that the IP address is talking to a new source or destination, or talking to the same hosts but using a different port or protocol, or sending anomalous number of bytes between them. The usefulness of such knowledge is not beyond doubt, however, for the scope of this question we assume that we want to do it nonetheless. — Ned_the_Dolphin, Apr 20 '20 at 19:37
Sure, community-id is perhaps a more economical way of storing the data. But my current challenge is not so much with storing the data as it is with processing the raw logs. — Ned_the_Dolphin, Apr 20 '20 at 19:37
We do use Kibana's Graph for visualization, Jupyter notebook playbooks, and sigma rules for various narrow and specific use cases but those are not really solutions to the problem at hand. Note that we plan to store the behavior of IP addresses over several months...more than the data that is currently in Elasticsearch. (We only keep about a week's worth of data in elastic and store older logs in Hadoop.) — Ned_the_Dolphin, Apr 20 '20 at 19:43
VAST seems very interesting. @mavam Where does VAST fit architecturally? On Zeek's worker or manager nodes? I agree with doing the analysis downstream. I also used networkx to build and query the graph for a subset of the IP addresses (~3000). — Ned_the_Dolphin, Apr 20 '20 at 19:50

Baselining internal network traffic (corporate)

1 Answers1