I am new to Structured Streaming and to PySpark. I should make use of Spark Streaming, preferably the Structured Streaming engine: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html. Using the streaming dataset provided, I need to apply the exponentially decaying window approach to keep smoothed counts of occurring events. Inspect the stream at one second intervals (to check which events occurred), and display the 5 most frequent events at (user-defined) intervals of t seconds.
Here's a brief pre view of the streaming dataset provided, called stream-data.csv
:
timestamp | event |
---|---|
2020-01-01 00:00:01.325 | A |
2020-01-01 00:00:01.817 | D |
2020-01-01 00:00:02.547 | C |
2020-01-01 00:00:04.548 | A |
2020-01-01 00:00:05.624 | A |
2020-01-01 00:00:07.239 | B |
2020-01-01 00:00:07.690 | E |
... (it has about 10 milion rows)
Also, it is stated that I may use the code provided (‘simple socket server.py’) as a starting point to generate the data to be consumed by the streaming engine. Here's the simple_socket.py:
import socket
import time
# Define the host and port
HOST = 'localhost'
PORT = 9999
# Create a socket object
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
server_socket.bind((HOST, PORT))
# Listen for incoming connections
server_socket.listen(1)
print('Server is listening on {}:{}'.format(HOST, PORT))
# Accept a connection from a client
client_socket, client_address = server_socket.accept()
print('Accepted connection from {}:{}'.format(client_address[0], client_address[1]))
# Send data to client
while True:
data = input()
client_socket.send((data + "\n").encode("utf-8"))
# Close the connection
client_socket.close()
server_socket.close()
I changed the server so it can read the dataset I should use for streaming, this way:
import socket
import time
# Define the host and port
HOST = 'localhost'
PORT = 9999
# Create a socket object
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
server_socket.bind((HOST, PORT))
# Listen for incoming connections
server_socket.listen(1)
print('Server is listening on {}:{}'.format(HOST, PORT))
# Accept a connection from a client
client_socket, client_address = server_socket.accept()
print('Accepted connection from {}:{}'.format(client_address[0], client_address[1]))
# Open the dataset file
with open('stream-data.csv', 'r') as f:
# Send data to client
for line in f:
client_socket.send((line + "\n").encode("utf-8"))
time.sleep(0.01) # delay to simulate streaming data
# Close the connection
client_socket.close()
server_socket.close()
Now, I need to step on the client's side to solve the rest of the exercise. I did this:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, from_csv, window, exp, log, sum, desc, to_timestamp, lit, from_unixtime, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
# Create a Spark session
spark = SparkSession.builder.getOrCreate()
# Define schema of the data
schema = "timestamp STRING, event STRING"
# Read stream from the socket
df = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Parse the CSV data
df = df.selectExpr("CAST(value AS STRING)").select(from_csv("value", schema).alias("data")).select("data.*")
# Convert the timestamp
df = df.withColumn("timestamp",
to_timestamp(col("timestamp"),
"yyyy-MM-dd HH:mm:ss.SSS"))
# Convert the timestamp to string in 'yyyy/MM/dd HH:mm:ss' format
df = df.withColumn("timestamp_string", from_unixtime(unix_timestamp(col("timestamp")),'yyyy/MM/dd HH:mm:ss'))
# Define the window
window = Window.orderBy(col('timestamp').cast('long')).rangeBetween(Window.unboundedPreceding, 0)
# Define the decay constant
decay_constant = 10**-9
# Calculate the decayed value for each row
df = df.withColumn("decayed_value", exp(log(lit(1.0 - decay_constant)) * col("timestamp").cast('long')))
# Calculate the decayed sum
df = df.withColumn("decay_sum", sum("decayed_value").over(window))
# Group by event and sum the decayed values
df_grouped = df.groupBy("event").agg(sum("decayed_value").alias("total_decayed_value"))
# Sort the events by the total decayed value and get the top 5 events
df_top_events = df_grouped.orderBy(desc("total_decayed_value")).limit(5)
# Define the output query
query = df_top_events.writeStream.outputMode("complete").format("console").trigger(processingTime='1 second').start()
# Wait for the termination of the query
query.awaitTermination()
However, I'm getting an "AnalysisException" error:
"Non-time-based windows are not supported on streaming DataFrames/Datasets;\nWindow [sum(decayed_value#17) windowspecdefinition..."
And I don't really know if this is the best approach. As I said, I'm very new to both PySpark and Structured Streaming and I don't seem to get a way of solving this, I've been trying for some weeks now. Can someone please help me with this, I would really appreciate it!