-1

When we are trying to stream the data from SSL enabled Kafka topic we are facing below error . Can you please help us on this issue .

19/11/07 13:26:54 INFO ConsumerFetcherManager: [ConsumerFetcherManager-1573151189884] Added fetcher for partitions ArrayBuffer()
19/11/07 13:26:54 WARN ConsumerFetcherManager$LeaderFinderThread: [spark-streaming-consumer_dvtcbddc101.corp.cox.com-1573151189725-d40a510f-leader-finder-thread], Failed to find leader for Set([inst_monitor_status_test,2], [inst_monitor_status_test,0], [inst_monitor_status_test,1])
java.lang.NullPointerException
        at org.apache.kafka.common.utils.Utils.formatAddress(Utils.java:408)
        at kafka.cluster.Broker.connectionString(Broker.scala:62)
        at kafka.client.ClientUtils$$anonfun$fetchTopicMetadata$5.apply(ClientUtils.scala:89)
        at kafka.client.ClientUtils$$anonfun$fetchTopicMetadata$5.apply(ClientUtils.scala:89)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:89)
        at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

Pyspark code :

from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer


def handler(message):
    records = message.collect()
    for record in records:
        print(record)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonStreamingKafkaWordCount")
    ssc = StreamingContext(sc, 10)

    zkQuorum, topic = sys.argv[1:]
    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
    counts.pprint()
    kvs.foreachRDD(handler)

    ssc.start()
    ssc.awaitTermination()

Spark submit command :

Spark submit:

/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 dsstream2.py host:2181 inst_monitor_status_test

  • `at kafka.cluster.Broker.connectionString` ... Sounds like you didn't get the right address for the cluster. If you `print(zkQuorum)`, do you get the right address? Also, do you really need Spark? You seem to have `from kafka import KafkaProducer` already, which is a native Python library – OneCricketeer Nov 07 '19 at 23:01
  • Plus, you seem to be missing any SSL related settings for Spark – OneCricketeer Nov 07 '19 at 23:04
  • Thanks for Your information .zkQuorum is having the right address . But still am not sure how to pass the SSL related settings for Spark. Could you please let me know if you have an idea on this. sample code would be great !! – Karthikeyan Rasipalayam Durai Nov 08 '19 at 15:22
  • Did you see this? https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#ssl--tls – OneCricketeer Nov 08 '19 at 18:20
  • Based on the above link , seems like SSL related setting for Spark we can incorporated only Scala or Java code . I think we can not incorporated SSL related information through KafkaParms in pyspark code . Could you please let me know any way to handle SSL features in pyspark code to connect Kerberos cluster. – Karthikeyan Rasipalayam Durai Nov 08 '19 at 21:45
  • Pyspark Kafka libraries all use the same JVM libraries as Java/Scala, so you should be able to pass the same options to Python Maps. The problem might be that you would want Zookeeper SASL_SSL connection as well – OneCricketeer Nov 08 '19 at 21:52
  • You might get more focused answers from Cloudera support. https://community.cloudera.com/t5/Community-Articles/Notable-Pits-for-Spark-Streaming-Reading-Kafka-with-Kerberos/tac-p/247055 – OneCricketeer Nov 08 '19 at 21:54
  • Thanks for All Your inputs . Please find way we have handled SSL . – Karthikeyan Rasipalayam Durai Nov 14 '19 at 15:42

1 Answers1

1

Thanks for your inputs . I have passed the SSL parameters in following method and working fine as expected.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time

#  Spark Streaming context :

spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)

#  Kafka Topic Details :

KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'

#  Creating  readstream DataFrame :

df = spark.readStream \
     .format("kafka") \
     .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
     .option("subscribe", KAFKA_TOPIC_NAME_CONS) \
     .option("startingOffsets", "earliest") \
     .option("kafka.security.protocol","SASL_SSL")\
     .option("kafka.client.id" ,"Clinet_id")\
     .option("kafka.sasl.kerberos.service.name","kafka")\
     .option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
     .option("kafka.ssl.truststore.password", "password_rd") \
     .option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
     .option("kafka.sasl.kerberos.principal","path") \
     .load()

df1 = df.selectExpr( "CAST(value AS STRING)")

#  Creating  Writestream DataFrame :

df1.writeStream \
   .option("path","target_directory") \
   .format("csv") \
   .option("checkpointLocation","chkpint_directory") \
   .outputMode("append") \
   .start()

ssc.awaitTermination()