I would like to create a streaming pipeline which uses a WebSocket connection to ingest real-time trading events and then publish them to a Pub/Sub topic.
My original idea was to use a cloud function which gets triggered frequently and publishes the payloads into Pub/Sub, at which point a subscriber ingests the data into BigQuery. The problem with this approach is that Google's cloud function service does not support the use of any websocket libraries. See this StackOverflow post for details.
As an alternative, I am considering to build my pipeline in a container and deploy it either as a Cloud Run instance or on Kubernetes, similar to how it's done in this repository. What I am struggling with is how triggering will work. Assuming that data is constantly flowing, I need to constantly run my pipeline once deployed. This is how my event-listener (main.py) script looks:
import websocket
import simplejson as json
import time
import ssl
import os
import json
from google.api_core.exceptions import NotFound
from google.cloud.pubsub import PublisherClient
from google.pubsub_v1.types import Encoding
from google.cloud.pubsub_v1.types import (
LimitExceededBehavior,
PublisherOptions,
PublishFlowControl
)
# Configure environment variables
project_id = "project_id"
topic_id = "pubsub_topic_id"
# Initialize publishing client, specify publisher settings and point it to the topic path
publisher_options = pubsub_v1.types.PublisherOptions(enable_message_ordering=False)
client_options = {"api_endpoint": "eu-west4-pubsub.googleapis.com:443"}
publisher_client = pubsub_v1.PublisherClient(publisher_options=publisher_options, client_options=client_options)
topic_path = publisher_client.topic_path(project_id, topic_id)
topic = publisher_client.get_topic(request={"topic": topic_path})
encoding = topic.schema_settings.encoding
# Establish connection to WebSocket stream
ws = websocket.WebSocket(sslopt={"cert_reqs": ssl.CERT_NONE})
ws.connect("wss://crypto.financialmodelingprep.com/")
# Define login, subscribe and unsubscribe events for WebSocket listener
login = {
'event':'login',
'data': {
'apiKey': "api_key",
}
}
subscribe = {
'event':'subscribe',
'data': {
'ticker': "btcusd",
}
}
unsubscribe = {
'event':'unsubscribe',
'data': {
'ticker': "btcusd",
}
}
ws.send(json.dumps(login)) # Send login request
time.sleep(1) # Wait for 1 second to prevent timeout
ws.send(json.dumps(subscribe)) # Subscribe listener
time.sleep(10) # Stream data for 10 seconds
ws.send(json.dumps(unsubscribe)) # Unsubscribe listener
# -- Function which fixes improper UNIX timestamp -- #
def filterts(dct):
if len(dct) > 3:
dct["t"] = round(dct.get("t") / 1000)
return dct
while True:
record = filterts(json.loads(ws.recv()))
try:
if encoding == Encoding.JSON and (len(record) > 3): # Do not accept event notifications (i.e., subscribed, unsubscribed)
data_str = json.dumps(record) # Prepare the record for publishing
print(f"Preparing a JSON-encoded message:\n{data_str}")
data = data_str.encode("utf-8")
else:
print(f"No encoding specified in {topic_path}. Abort.")
exit(0)
future = publisher_client.publish(topic_path, data) # Publish JSON payload to topic
print(f"Published message ID: {future.result()}")
except NotFound:
print(f"{topic_id} not found.")
As you can see, I can configure how long events are streamed for via the time.sleep(n)
function. The issue is, I get a timeout after one hour of listening and the script finishes with an error. How can I set up my app so that it automatically re-executes after the 1 hour threshold is reached?
Please let me know if I need to clarify anything further.