0

I am new to Structured Streaming and to PySpark. I should make use of Spark Streaming, preferably the Structured Streaming engine: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html. Using the streaming dataset provided, I need to apply the exponentially decaying window approach to keep smoothed counts of occurring events. Inspect the stream at one second intervals (to check which events occurred), and display the 5 most frequent events at (user-defined) intervals of t seconds.

Here's a brief pre view of the streaming dataset provided, called stream-data.csv:

timestamp event
2020-01-01 00:00:01.325 A
2020-01-01 00:00:01.817 D
2020-01-01 00:00:02.547 C
2020-01-01 00:00:04.548 A
2020-01-01 00:00:05.624 A
2020-01-01 00:00:07.239 B
2020-01-01 00:00:07.690 E

... (it has about 10 milion rows)

Also, it is stated that I may use the code provided (‘simple socket server.py’) as a starting point to generate the data to be consumed by the streaming engine. Here's the simple_socket.py:


import socket
import time

# Define the host and port
HOST = 'localhost'
PORT = 9999

# Create a socket object
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Bind the socket to the host and port
server_socket.bind((HOST, PORT))

# Listen for incoming connections
server_socket.listen(1)
print('Server is listening on {}:{}'.format(HOST, PORT))

# Accept a connection from a client
client_socket, client_address = server_socket.accept()
print('Accepted connection from {}:{}'.format(client_address[0], client_address[1]))

# Send data to client
while True:
    data = input()
    client_socket.send((data + "\n").encode("utf-8"))


# Close the connection
client_socket.close()
server_socket.close()

I changed the server so it can read the dataset I should use for streaming, this way:

import socket
import time

# Define the host and port
HOST = 'localhost'
PORT = 9999

# Create a socket object
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Bind the socket to the host and port
server_socket.bind((HOST, PORT))

# Listen for incoming connections
server_socket.listen(1)
print('Server is listening on {}:{}'.format(HOST, PORT))

# Accept a connection from a client
client_socket, client_address = server_socket.accept()
print('Accepted connection from {}:{}'.format(client_address[0], client_address[1]))

# Open the dataset file
with open('stream-data.csv', 'r') as f:
    # Send data to client
    for line in f:
        client_socket.send((line + "\n").encode("utf-8"))
        time.sleep(0.01)  # delay to simulate streaming data

# Close the connection
client_socket.close()
server_socket.close()

Now, I need to step on the client's side to solve the rest of the exercise. I did this:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, from_csv, window, exp, log, sum, desc, to_timestamp, lit, from_unixtime, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Define schema of the data
schema = "timestamp STRING, event STRING"

# Read stream from the socket
df = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Parse the CSV data
df = df.selectExpr("CAST(value AS STRING)").select(from_csv("value", schema).alias("data")).select("data.*")

# Convert the timestamp
df = df.withColumn("timestamp", 
                   to_timestamp(col("timestamp"), 
                                 "yyyy-MM-dd HH:mm:ss.SSS"))

# Convert the timestamp to string in 'yyyy/MM/dd HH:mm:ss' format
df = df.withColumn("timestamp_string", from_unixtime(unix_timestamp(col("timestamp")),'yyyy/MM/dd HH:mm:ss'))

# Define the window
window = Window.orderBy(col('timestamp').cast('long')).rangeBetween(Window.unboundedPreceding, 0)

# Define the decay constant
decay_constant = 10**-9

# Calculate the decayed value for each row
df = df.withColumn("decayed_value", exp(log(lit(1.0 - decay_constant)) * col("timestamp").cast('long')))

# Calculate the decayed sum
df = df.withColumn("decay_sum", sum("decayed_value").over(window))

# Group by event and sum the decayed values
df_grouped = df.groupBy("event").agg(sum("decayed_value").alias("total_decayed_value"))

# Sort the events by the total decayed value and get the top 5 events
df_top_events = df_grouped.orderBy(desc("total_decayed_value")).limit(5)

# Define the output query
query = df_top_events.writeStream.outputMode("complete").format("console").trigger(processingTime='1 second').start()

# Wait for the termination of the query
query.awaitTermination()

However, I'm getting an "AnalysisException" error:
"Non-time-based windows are not supported on streaming DataFrames/Datasets;\nWindow [sum(decayed_value#17) windowspecdefinition..."

And I don't really know if this is the best approach. As I said, I'm very new to both PySpark and Structured Streaming and I don't seem to get a way of solving this, I've been trying for some weeks now. Can someone please help me with this, I would really appreciate it!

Koedlt
  • 4,286
  • 8
  • 15
  • 33
bdzh
  • 5
  • 2

0 Answers0