My goal is to send/produce a txt from my Windows PC to a container running Kafka, to then be consume by pyspak (running in other container).
I'm using docker-compose where I define a custom net and several containers, such as: spark-master, two workers, ZooKeeper and Kafka.
I'had several problems with Kafka, Spark and Python version compatibility so I decided to use the latest binami image version for each one of them.
I created a topic called 'demo' and from the Kafka container I'm trying to send text to Spark using: "bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic demo
"
When I execute from spark-master container the python code with "spark-submit mycode.py" I get this error message:
Traceback (most recent call last): File "/src/structuredKafkaSpark.py", line 12, in df = spark \ File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 469, in load File "/opt/bitnami/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
I don't know understand what I'm doing wrong, I'm following the Spark guide, link.
My final goal is to send a txt file from my PC to Kafka to then be consumed by Spark where I will do some map() transformations.
docker-compose file:
version: "3.7"
networks:
datapipeline:
driver: bridge
services:
spark-master:
build:
context: ./spark
dockerfile: ./Dockerfile
container_name: "spark-master"
environment:
- SPARK_MODE=master
- SPARK_LOCAL_IP=spark-master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- "7077:7077"
- "8080:8080"
volumes:
- ./src:/src
- ./data:/data
- ./output:/output
networks:
- datapipeline
spark-worker-1:
image: docker.io/bitnami/spark:latest
container_name: "spark-worker-1"
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
spark-worker-2:
image: docker.io/bitnami/spark:latest
container_name: "spark-worker-2"
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
# ----------------- #
# Apache Kafka #
# ----------------- #
zookeeper:
image: docker.io/bitnami/zookeeper:latest
container_name: "zookeeper"
ports:
- "2181:2181"
environment:
- ALLOW_ANONYMOUS_LOGIN=yes
networks:
- datapipeline
kafka:
image: docker.io/bitnami/kafka:latest
container_name: "kafka"
ports:
- "9092:9092"
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
depends_on:
- zookeeper
volumes:
- ./producer:/producer
networks:
- datapipeline
Python code run in spark-master:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.config("spark.driver.host", "localhost")\
.getOrCreate()
# Create DataFrame representing the stream of input lines from kafka
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "demo") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# Split the lines into words
words = df.select(
explode(
split(df.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()