In the snippet below I try to transform a DStream of temperatures (received from Kafka) into a pandas Dataframe.
def main_process(time, dStream):
print("========= %s =========" % str(time))
try:
# Get the singleton instance of SparkSession
spark = getSparkSessionInstance(dStream.context.getConf())
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = dStream.map(lambda t: Row(Temperatures=t))
df = spark.createDataFrame(rowRdd)
df.show()
print("The mean is: %m" % df.mean())
As is, the mean is never calculated, which I suppose is because "df" is not a pandas dataframe (?).
I tried using df = spark.createDataFrame(df.toPandas())
according to the relevant documentation but the compiler doesn't recognize "toPandas()" and the transformation never occurs.
Am I in the right path, and if so how should I apply the transformation?
Or maybe my approach is wrong and I must handle the DStream in a different way?
Thank you in advance!