0

My goal is to send/produce a txt from my Windows PC to a container running Kafka, to then be consume by pyspak (running in other container). I'm using docker-compose where I define a custom net and several containers, such as: spark-master, two workers, ZooKeeper and Kafka. I'had several problems with Kafka, Spark and Python version compatibility so I decided to use the latest binami image version for each one of them. I created a topic called 'demo' and from the Kafka container I'm trying to send text to Spark using: "bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic demo"

When I execute from spark-master container the python code with "spark-submit mycode.py" I get this error message:

Traceback (most recent call last): File "/src/structuredKafkaSpark.py", line 12, in df = spark \ File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 469, in load File "/opt/bitnami/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

I don't know understand what I'm doing wrong, I'm following the Spark guide, link.

My final goal is to send a txt file from my PC to Kafka to then be consumed by Spark where I will do some map() transformations.

docker-compose file:

version: "3.7"

networks:
    datapipeline:
        driver: bridge

services:
  spark-master:
    build:
      context: ./spark
      dockerfile: ./Dockerfile
    container_name: "spark-master"
    environment:
      - SPARK_MODE=master
      - SPARK_LOCAL_IP=spark-master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - "7077:7077"
      - "8080:8080"
    volumes:
      - ./src:/src
      - ./data:/data
      - ./output:/output
    networks:
      - datapipeline

  spark-worker-1:
    image: docker.io/bitnami/spark:latest
    container_name: "spark-worker-1"
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no


  spark-worker-2:
    image: docker.io/bitnami/spark:latest
    container_name: "spark-worker-2"
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no



  # ----------------- #
  # Apache Kafka      #
  # ----------------- #
  zookeeper:
    image: docker.io/bitnami/zookeeper:latest
    container_name: "zookeeper"
    ports:
      - "2181:2181"
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
    networks:
      - datapipeline

  kafka:
    image: docker.io/bitnami/kafka:latest
    container_name: "kafka"
    ports:
      - "9092:9092"
    environment:
      - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
      - ALLOW_PLAINTEXT_LISTENER=yes
    depends_on:
      - zookeeper
    volumes:
      - ./producer:/producer
    networks:
      - datapipeline

Python code run in spark-master:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .config("spark.driver.host", "localhost")\
    .getOrCreate()


# Create DataFrame representing the stream of input lines from kafka
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "kafka:9092") \
  .option("subscribe", "demo") \
  .load()

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Split the lines into words
words = df.select(
   explode(
       split(df.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

 # Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("update") \
    .format("console") \
    .start()

query.awaitTermination()
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
yaviens
  • 25
  • 7

1 Answers1

1

I'm following the Spark guide, link

The Spark container doesn't include the spark-sql-kafka-0-10 dependency. You need to add it, as that link says... For example, using spark.jars.packages

Make note of the first two versions. If they are incorrect, you will still get errors.

# TODO: Ensure these are correct
scala_version = '2.12'
spark_version = '3.2.1'

packages = [
    f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
    'org.apache.kafka:kafka-clients:3.2.3'
]
spark = SparkSession.builder\
   .master("spark://spark-master:7077")\
   .appName("kafka-example")\
   .config("spark.jars.packages", ",".join(packages))\
   .getOrCreate()

Ref. https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb

Otherwise, you need to run spark-submit --packages '...' mycode.py with the same list

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245