Pyspark selecting multiple ordered data streams into one RDD in a performant way

Question

I am redesigning a real-time prediction pipeline over streaming IoT sensor data. The pipeline is ingesting sensor data samples, structured as (sensor_id, timestamp, sample_index, value) as they are created in the source system, saves them locally and runs pyspark batch jobs for training algorithms and making predictions.

Currently, sensor data is saved to local files on disk with a single file per sensor and to HDFS for spark streaming. The streaming job picks up each microbatch, calculates how many samples arrived for each sensor and decides which sensors accumulated enough new data to make a new prediction. It then maps each sensor row in the RDD to a method that opens the data file using python open method, scans to the last processed sample, picks up the data from that sample onwards plus some history data required for the prediction, and runs the prediction job on the spark cluster. In addition, every fixed number of samples each algorithm requires a refit, which queries a long history from the same data store and runs on the spark cluster.

Finally, the RDD that is processed by the prediction job looks like this:

|-----------------------------|
| sensor_id | sensor_data     |
|-----------------------------|
| SENSOR_0  | [13,52,43,54,5] |
| SENSOR_1  | [22,42,23,3,35] |
| SENSOR_2  | [43,2,53,64,42] |
|-----------------------------|

We are now encountering a problem of scale when monitoring a few hundred thousand sensors. It seems that the most costly operation during the process is reading data from files - a few dozen millisecond latency in reading each file accumulates to unmanageable latency for the entire prediction job. Further, storing the data as flat files on disk is not scalable at all.

We are looking into changing storage method in order to up performance and offer scalability. Using time series databases (we tried timescaledb & influxdb) poses the problem of querying the data for all sensors in one query, when each sensor needs to be queried from a different point in time, and then grouping the separate samples into the sensor_data column as seen above, which is very costly, causes lots of shuffles and even underperforms the flat files solution. We are also trying parquet files, but their single write behavior makes it difficult to plan a data structure that will perform well in this case.

tl;dr - I am looking for a performant architecture for the following scenario:

streaming sensor data is ingested in real time
when a sensor accumulates enough samples, current + historic data is queried and sent to prediction job
each prediction job handles all sensors that reached threshold in the last microbatch
RDD contains rows of sensor ID and an ordered array of all queried samples

How many samples qualify for each microbatch for new predictions? And how long does the prediction process take? Can the prediction process take longer than microbatches? Combined from these 2, can you tell how many files are open at any time? — xenodevil, Feb 12 '20 at 08:34
1. samples for new prediction is configurable, usually 50-100 2. the prediction process _compute_ time is minimal, opening and reading each file takes much longer than computing prediction on the data from the file 3. the prediction process must never take longer than microbatches 4. we can be up in the tens of thousands of files open at each time — Eliaz, Feb 12 '20 at 09:16

Pyspark selecting multiple ordered data streams into one RDD in a performant way

0 Answers0