1

In the snippet below I try to transform a DStream of temperatures (received from Kafka) into a pandas Dataframe.

def main_process(time, dStream):
print("========= %s =========" % str(time))

try:
    # Get the singleton instance of SparkSession
    spark = getSparkSessionInstance(dStream.context.getConf())

    # Convert RDD[String] to RDD[Row] to DataFrame
    rowRdd = dStream.map(lambda t: Row(Temperatures=t))

    df = spark.createDataFrame(rowRdd)

    df.show()

    print("The mean is: %m" % df.mean())

As is, the mean is never calculated, which I suppose is because "df" is not a pandas dataframe (?).

I tried using df = spark.createDataFrame(df.toPandas()) according to the relevant documentation but the compiler doesn't recognize "toPandas()" and the transformation never occurs.

Am I in the right path, and if so how should I apply the transformation?

Or maybe my approach is wrong and I must handle the DStream in a different way?

Thank you in advance!

HappyCane
  • 363
  • 1
  • 2
  • 10
  • Not a good idea. Why do you want that? –  Oct 23 '16 at 12:43
  • I need to feed the data from the flow to certain algorithms that employ relatively complex mathematical functions. So my first thought was to transform the DStream into something more flexible such as a dataframe. Should I stick with the RDD? – HappyCane Oct 23 '16 at 13:37
  • Spark `DataFrame` is OK (distributed). Pandas `DataFrame` is typically bad (not distributed, local to driver). –  Oct 23 '16 at 14:09

0 Answers0