Difficulties translating Scala Spark-Streaming code to Pyspark

Question

I am trying to translate the Spark implementation to Pyspark, which is discussed in this blog:

https://dorianbg.wordpress.com/2017/11/11/building-the-speed-layer-of-lambda-architecture-using-structured-spark-streaming/

However, I am having a lot of problems because some of the methods in a Spark Datafram aren't available or need to go through some conversions to make them work. I am specifically having trouble with this part:

var data_stream_cleaned = data_stream
.selectExpr("CAST(value AS STRING) as string_value")
.as[String]
.map(x => (x.split(";"))) //wrapped array
.map(x => tweet(x(0), x(1), x(2),  x(3), x(4), x(5)))
.selectExpr( "cast(id as long) id", "CAST(created_at as timestamp) created_at",  "cast(followers_count as int) followers_count", "location", "cast(favorite_count as int) favorite_count", "cast(retweet_count as int) retweet_count")
.toDF()
.filter(col("created_at").gt(current_date()))   // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output
.groupBy("location")
.agg(count("id"), sum("followers_count"), sum("favorite_count"),  sum("retweet_count"))

How would you go about making this work? I have successfully connected to a Kafka stream. I'm just trying to aggregate the data so that I can load it to Redshift.

This is what I have so far:

ds = data_stream.selectExpr("CAST(value AS STRING) as string_value").rdd.map(lambda x: x.split(";"))

I get an error saying

Queries with streaming sources must be executed with writeStream.start()

What could be wrong? I'm not trying to query the data, just transform it. Any help would be greatly appreciated!

`selectExpr` is a "query", but you're missing `ds.writeStream()` to go out to the console, for example. The "output stage" of that post — OneCricketeer, Sep 26 '19 at 01:19
Is there a way to do the aggregations directly on the stream? Sorry I'm still a Spark newbie. I have looked in the documentation, but the PySpark-Docs leave much to be desired :( — Nelson Fleig, Sep 26 '19 at 01:28
Yes, via a windowed stream. The scala code is doing an aggregate on the location, and pyspark ultimately translates to the same thing — OneCricketeer, Sep 26 '19 at 04:15

Difficulties translating Scala Spark-Streaming code to Pyspark

0 Answers0