How to calculate the sum of values using PySpark with Kafka and Spark Streaming

Question

Currently I receive 4 or more vehicle IoT sensory data records every 1 second, and I would like to for just simplicity sakes start by adding the 4 values for velocity readings. Most of the examples of code I have found provide counts which I can already do, but how would I simply add 4 or more separate lines of velocity values. Right now, the output shows a 1 second time stamp with 4x extracted velocity values.


from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import pyspark.sql.functions as sf
from pyspark.sql.functions import udf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

conf = SparkConf().setAppName("rjws-sparkstreams")

#Pauses for Context Load
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")

ssc = StreamingContext(sc, 1)
kafkaStream = KafkaUtils.createStream(ssc, '172.16.10.1:2181', 'spark-streaming', {'vehicle_events':1})



#Presents JSON formatted data
KafkaStream_json = kafkaStream.map(lambda x: json.loads(x[1]))

#Parses the Velocity column of data
velocity_dstream = KafkaStream_json.map(lambda vehicle_events_fast_testdata: vehicle_events_fast_testdata["velocity"])
velocity_readings = velocity_dstream.countByValue()
top_reads = velocity_readings.transform(lambda rdd:sc.parallelize(rdd.take(50)))

ssc.start()
ssc.awaitTermination()

I have attempted adding the following code too:

total = 0
def velParse(vehicle_events_fast_testdata):
    total = sum(vehicle_events_fast_testdata["velocity"]) + (total)
    return vehicle_events_fast_testdata["velocity"]

velocity_dstream = KafkaStream_json.map(lambda vehicle_events_fast_testdata: velParse(vehicle_events_fast_testdata))

However this does not properly calculate the sum of the velocity readings, it states the item is not iterable. Thanks

How to calculate the sum of values using PySpark with Kafka and Spark Streaming

0 Answers0