5

As part of a security product I have high scale cloud service (azure worker role) that reads events from event hub, batches them to ~2000 and stores in blob storage. Each event has a MachineId (the machine that sent it). Events are coming from the event hub in random order and I store them in random order in blob storage. The throughput is up to 125K events/sec, each event is ~2K so we have up to 250MB/sec of traffic. We have ~1M machines...

Later, another cloud service downloads the blobs and runs some detection logic on the events. He groups the events by MachineId and tries to undestand something from the machine timeline

The problem is that today events from the same machine are populated to different blobs. If I could somehow group the events by their MachineId and make sure that some time window of a machine is populated to the same blob this would increase the detections I could do in the cloud.

We do write the event to another Map reduce system and there we are doing much complex detections, but those of course are having high latency. If I could group the events better in the cloud I could catch more in real time

I there any technology that might help me with that?

Thanks in advance

Sreeram Garlapati
  • 4,877
  • 17
  • 33
Zorik
  • 207
  • 1
  • 3
  • 9
  • Aren't events partitioned by machine id in event hubs? – Mikhail Shilkov Jan 20 '18 at 10:44
  • Right now they are not. Part of the thoughts was to use another event hub to partition the events to groups. As far as I know event hub partition limit is 1K so it will split my data (~1M machines) to groups of ~1K. but this is only partial solution, looking if I can do better? – Zorik Jan 20 '18 at 14:53
  • How did you manage to achieve 125K event/sec and 250MB/sec? Do you use more than 20 TU? – Kuba Wyrostek May 17 '18 at 21:51
  • Kuba Wyrostek - yes we do – Zorik Jun 02 '18 at 14:49

1 Answers1

6

tl;dr: Introducing another EventHub - in between the original eventhub and the blob storage - which re-partitions data per MachineID - is the best way to go.

In general, have one INJESTING EVENTHUB - which is just an entry point to your monitoring system. Use EventHubClient.send(eventData_without_partitionKey) approach to send to this INJESTING EVENTHUB. This will allow you to send with very low latency and high availability - as it will send to which ever partition that is currently taking less load and is available..

 --------------                     -----------                 ----------
|              |    =========      |           |    ====       |          |
|  INJESTING   |    RE-PARTITION > |  INTERIM  |    BLOB \     |   BLOB   |
|  EVENTHUB    |    =========      |  EVENTHUB |    PUMP /     |          |
|              |                   |           |    ====        ----------
 --------------                     -----------

Most importantly, refrain from partitioning data directly on the Ingesting EventHub, for these factors:

  1. Highly available ingestion pipeline - Not associating events to a partition - will keep your ingestion pipeline highly available. Behind the scenes, we host each of your EventHubs Partition on a Container. When you provide PartitionKey on your EventData - that PartitionKey will be hash'ed to a specific partition. Now, the Send operation latency will be tied to that single Partition's availability - events like windows OS upgrade or our service upgrade etc., could impact them. Instead, if you stick to EventHubClient.send(without_PartitionKey) - we will route the EventData as soon as possible to the available partition - so, your ingestion pipeline is guaranteed to be Highly available.
  2. flexible data design - in distributed systems quite often you will soon need to re-partition data based on a different key. Be sure to - measure the probability in your system for this :).

Use Interim EventHubs as a way to partition data. i.e., in the RE-PARTITION module - you are simply replaying the original stream to INTERIM EVENTHUB by swapping one property to EventData.PARTITION_KEY - which was originally empty.

// pseudo-code RE-PARTITION EVENTS
foreach(eventData receivedFromIngestingEventHubs)
{
    var newEventData = clone(eventData);
    eventHubClient.send(newEventData, eventData.Properties.get("machineId"))
}

What this makes sure - is that - all EventData's with a Specific MachineID are available on 1 and 1 - EventHubs Partition. You do not need to create 1M EventHubs Partitions. Each partition can hold infinite number of PartitionKeys. You could use EventProcessorHost to host this per partition logic or an Azure Stream analytics Job.

Also, this is your chance to filter and produce an optimal stream - which is consumable by the down-stream processing pipeline.

In the BLOB PUMP module (your down-stream processing pipeline) - when you consume events from a specific INTERIM EVENTHUB's Partition - you are now guaranteed to have all Events from a specific machineid - on this partition. Aggregate the events per your required size - 2k - based on the PartitionId (machineId) - you will not have all the events continuously - you will need to build an in-memory aggregation logic for this (using EventProcessorHost or AzureStreamAnalytics Job.

Sreeram Garlapati
  • 4,877
  • 17
  • 33
  • First of all thanks a lot Sreeram for the detailed answer! I` – Zorik Jan 25 '18 at 22:22
  • (1) What do you think is optimal number of partitions? 100? On the one hand I want to have as many as I can to partition the data better but on the other hand I don`t want the partitioner to work in front of too many partitions (higher network load). – Zorik Jan 25 '18 at 22:40
  • (2) Also I guess the partitioner will have to batch somehow too, to send batches to INTERIM EVENTHUB, also to reduce network load. do you think with my scale the partitioner will be able to keep up? – Zorik Jan 25 '18 at 22:41
  • (3) One more concern is scalability of the BLOB PUMP. We will have one role reading from each INTERIM EVENTHUB partition - so we can not scale up, and partition number can not be changed dynamically, so I have nothing to do if I have unexpected increase in the traffic that I can`t handle (only scale up maybe, but unlike scale out this is limited). – Zorik Jan 25 '18 at 22:41
  • 1
    1) looking at the load 250mbps 125k events - 250 partitions should be good. if you foresee load increase - account for it in partition count. 2) whether batch or not - n/w load can be same :) - as long as you keep the connections alive (re-using EventHubClient keeps connections alive). 3) good point. when you use partitionId - "skewed partitions" is a problem you will run into. However, for infinitely large number of partitions (like you have 1M machines) - this has fairly Rare chance. The only working solution I know of is to - split - by introducing another layer of RE-PARTITION EVENTHUB. – Sreeram Garlapati Jan 26 '18 at 00:51
  • Again thanks :) (1) Just making sure that I don`t miss anything - 250 partitions means 250 VM instances of BLOB PUMP? For now we process the traffic with ~10 instances. Of course we know that we will have to increase that significantly but do you think we can we get along with less than 250? (2) Are you saying batching the events for the event hub does not provide any performance value? – Zorik Jan 26 '18 at 16:04
  • 1
    1) by 250 partitions - I meant 250 EventHubs partitions. You can map 1 Partition to 1 process where EPH is running - which is specific to your implementation :). 2) I never said that! Batching events for eventhubs doesn't reduce any network load (bytes transferred wouldn't be very different). Performance - gain will be see seen from EventHubs service implementation perspective. – Sreeram Garlapati Jan 26 '18 at 17:51
  • (1) Got it, thanks! (2) Will performance gain be seen on the sender who send those bytes? In our case it is the repartition logic. I`m a bit concerned about the repartition logic performance, each instance there consumes not grouped data and will send to 250 different partitions. So I thought sending 1-by-1 will put high load on each VM in the repartition logic. If batching will not help I guess I prefer to keep it simple. Do you recommend to send 1-by-1? Also how many VM instances of repartition logic and how many partitions of injesting event hub you recommend to start with for my traffic? – Zorik Jan 26 '18 at 22:18
  • (4) Also can you please share your thoughts about the trade offs between the 2 in-memory aggregation logic implementations (the BLOB PUMP) you suggested? My instinct is to use EventProcessorHost and implement it by myself but this is mostly because I never used StreamAnalytics before. What are the main advantages and disadvantages of StreamAnalytics for this task? What would you recomend? Thanks again! – Zorik Jan 26 '18 at 22:40