tl;dr:
Introducing another EventHub - in between the original eventhub and the blob storage - which re-partitions data per MachineID - is the best way to go.
In general, have one INJESTING EVENTHUB - which is just an entry point to your monitoring system. Use EventHubClient.send(eventData_without_partitionKey)
approach to send to this INJESTING EVENTHUB
. This will allow you to send with very low latency and high availability - as it will send to which ever partition that is currently taking less load and is available..
-------------- ----------- ----------
| | ========= | | ==== | |
| INJESTING | RE-PARTITION > | INTERIM | BLOB \ | BLOB |
| EVENTHUB | ========= | EVENTHUB | PUMP / | |
| | | | ==== ----------
-------------- -----------
Most importantly, refrain from partitioning data directly on the Ingesting EventHub, for these factors:
- Highly available ingestion pipeline - Not associating events to a partition - will keep your ingestion pipeline highly available. Behind the scenes, we host each of your
EventHubs Partition
on a Container
. When you provide PartitionKey
on your EventData
- that PartitionKey
will be hash'ed to a specific partition. Now, the Send
operation latency will be tied to that single Partition
's availability - events like windows OS upgrade or our service upgrade etc., could impact them. Instead, if you stick to EventHubClient.send(without_PartitionKey)
- we will route the EventData
as soon as possible to the available partition - so, your ingestion pipeline is guaranteed to be Highly available
.
- flexible data design - in distributed systems quite often you will soon need to re-partition data based on a different key. Be sure to - measure the probability in your system for this :).
Use Interim EventHubs as a way to partition data. i.e., in the RE-PARTITION
module - you are simply replaying the original stream to INTERIM EVENTHUB
by swapping one property to EventData.PARTITION_KEY
- which was originally empty.
// pseudo-code RE-PARTITION EVENTS
foreach(eventData receivedFromIngestingEventHubs)
{
var newEventData = clone(eventData);
eventHubClient.send(newEventData, eventData.Properties.get("machineId"))
}
What this makes sure - is that - all EventData
's with a Specific MachineID
are available on 1 and 1 - EventHubs Partition
. You do not need to create 1M EventHubs Partitions. Each partition can hold infinite number of PartitionKey
s. You could use EventProcessorHost
to host this per partition logic or an Azure Stream analytics Job
.
Also, this is your chance to filter and produce an optimal stream - which is consumable by the down-stream processing pipeline.
In the BLOB PUMP module (your down-stream processing pipeline) - when you consume events from a specific INTERIM EVENTHUB's Partition
- you are now guaranteed to have all Events
from a specific machineid - on this partition. Aggregate the events per your required size - 2k
- based on the PartitionId (machineId) - you will not have all the events continuously - you will need to build an in-memory aggregation logic for this (using EventProcessorHost
or AzureStreamAnalytics Job
.