0

I am passing high number of events(more than 1000)/second from multiple sensors to a single event hub. While passing data from sensor to an event hub i don't have access to sensor id, so i can only use 1 partition as event ordering is essential. Output from event hub is stream analytics which then saves data to cosmosDB.

Event Hub(single partition) -> Stream Analytics -> CosmosDB

The issue is as the number of requests increases the latency increases as well. I was thinking of using a intermediate event hub where i could set partition key.

Event Hub(multiple partition) -> Stream Analytics -> Event Hub with Partition Key -> Stream Analytics -> CosmosDB

My concern is:

Will the event ordering be maintained in intermediate Event Hub?

Is there a performance benefit with that architecture?

And also i need to update the UI in website and mobile. Do i use cosmos DB change feed or signal R as output to stream analytics?

So i tested the system sending around 200 requests/second. I used azure function to send these requests to event hub.

Function Metrics:Request sent from azure function to event hub

Note: Event hub has 20 partitions and each event were send with a partition key.

I used another azure function to read the data off the event hub. Initially tested only by logging the data's count (without saving data to cosmosDB).

Note: I used maxBatchSize to 1 for data ordering (I am not sure if i need to do this.If i increase this batchsize will i still maintain data ordering?)

I can see that this function was able to read the data off the event hub at the same speed that it was being written.

Function Metrics: Azure function reading the data

However once i added the code to save these data to database, performance decreased significantly.

Note: CosmosDB RUs was set to 15000 RU/s

Function Metrics: Function only getting around 20req/s

I believe there is something wrong with my code. Here the function i am using

 [FunctionName("ProcessStreamData")]
    public static async Task Run([EventHubTrigger("eventhub-name", Connection = "EventHubsConnection")] EventData[] podStreamData, [CosmosDB(
            databaseName: "dbname",
            collectionName: "containername",
            ConnectionStringSetting ="CosmosDBConnection")]
        IAsyncCollector<SensorData> PodStreamDataOut, ILogger log)
    {
        var exceptions = new List<Exception>();

        foreach (EventData eventData in podStreamData)
        {
            try
            {
                var messageBody = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
                var allData = JsonConvert.DeserializeObject<List<SensorData>>(messageBody);

     //I have data for different sensors in one eventdata so i'll need 
    //to loop around each of these data 
    //and create dynamic partitonkey ,ttl for cosmos 

                foreach (SensorData data in allData)
                {
                    data.partitionKey = $"{data.mac}-{DateTime.UtcNow:yyyy-MM}";
                    data.ttl = 60 * 60 * 60 * 24 * 60; //60 days
                    data.timestamp = DateTime.UtcNow;
                    await PodStreamDataOut.AddAsync(data);
                }
            }
            catch (Exception e)
            { 
                exceptions.Add(e);
            }
        }
  • Can you please clarify why ordering matters once the data lands in CosmosDB? CosmosDB has physical partitions and ordering across these partitions will not be maintained even if there was just one partition. – Vignesh Chandramohan Dec 16 '19 at 19:40
  • `I used maxBatchSize to 1 for data ordering` will decrease performance for sure. You can leave default one and it will keep ordering. Because all events in batch will be from the same partition. – wolszakp Dec 19 '19 at 14:42
  • @wolszakp with a default maxBatchSize, max number of EventData i got was 10. Is there anything wrong with the code 'await PodStreamDataOut.AddAsync(data)'. This line seems to slow down the whole process. – Nabin Shahukhal Dec 19 '19 at 23:22

1 Answers1

0

I am assuming that you need in order processing events from single sensor.

You can send data using Send event with publisher ID and then events from single sensor will land in the same partition.

We are using architecture

Event Hub(multiple partition - send with publisher ID) -> Azure Function -> CosmosDB

We got this architecture and in order processing works fine.

Here you can find details how to do it with functions. In order event processing with Azure Functions

wolszakp
  • 1,109
  • 11
  • 25
  • Yes, the above architecture works fine for in order processing. However, when i was doing load test (with JMeter : 1000 requests/sec), i saw that data is stored with very high latency. It took more than 5 minutes to store data in cosmosDB after all the requests were sent out. – Nabin Shahukhal Dec 16 '19 at 22:03
  • We got load tests with 10 000 requests/sec. And it passed. However you need to take into consideration that we got function in consumption plan. Other thing is that to have full speed you need to do a warm-up. In highes workload we got 33 azure functions (1 per partition) reading from EventHub. On the other hand - in our solution we got different storage than CosmosDB. I would recommend to look at the CosmosDB pricing plan. I suppose that it can be bottleneck. – wolszakp Dec 17 '19 at 12:29
  • @NabinShahukhal If my answer resolves your issue please mark it. – wolszakp Dec 18 '19 at 08:39
  • Sorry for responding late. I did some tests and found that incoming and outgoing messages for event hub wast fast enough and azure function was to able to scale to higher number of request. However, while saving data to cosmos, function wasn't able to scale as required. I updated my question with the metrics of my test. Please have a look. @wolszakp – Nabin Shahukhal Dec 19 '19 at 06:06