I am trying to translate the Spark implementation to Pyspark, which is discussed in this blog:
However, I am having a lot of problems because some of the methods in a Spark Datafram aren't available or need to go through some conversions to make them work. I am specifically having trouble with this part:
var data_stream_cleaned = data_stream
.selectExpr("CAST(value AS STRING) as string_value")
.as[String]
.map(x => (x.split(";"))) //wrapped array
.map(x => tweet(x(0), x(1), x(2), x(3), x(4), x(5)))
.selectExpr( "cast(id as long) id", "CAST(created_at as timestamp) created_at", "cast(followers_count as int) followers_count", "location", "cast(favorite_count as int) favorite_count", "cast(retweet_count as int) retweet_count")
.toDF()
.filter(col("created_at").gt(current_date())) // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output
.groupBy("location")
.agg(count("id"), sum("followers_count"), sum("favorite_count"), sum("retweet_count"))
How would you go about making this work? I have successfully connected to a Kafka stream. I'm just trying to aggregate the data so that I can load it to Redshift.
This is what I have so far:
ds = data_stream.selectExpr("CAST(value AS STRING) as string_value").rdd.map(lambda x: x.split(";"))
I get an error saying
Queries with streaming sources must be executed with writeStream.start()
What could be wrong? I'm not trying to query the data, just transform it. Any help would be greatly appreciated!