I have been using a Spark streaming job using Python on Databricks to load sources from an Azure IotHub. However I noticed, when we have a large number of received frames, the job comes long, so we have latency knowing that when we look at the metrics the CPU and memory are not used at 100% of their capacity.
Asked
Active
Viewed 206 times
1 Answers
2
IoT Hub, similarly to EventHubs has its own throughput limits based on the provisioned capacity, so you can't read more than X MB/sec or N messages/sec.
Also, you need to remember that EventHubs connector maps EventHubs partitions 1:1 into Spark partitions, so if EventHubs/IoT Hub has fewer partitions than Spark cores, then not all cores are used. As alternative, you can consider using of Kafka connector to connect to EventHubs/IoT Hub as it allows to have more Spark partitions than partitions in EventHubs (see minPartitions option in Kafka connector)

Alex Ott
- 80,552
- 8
- 87
- 132