I have a Kafka-PySpark streaming code that reads the topic.
I have the kafka configuration given in a kafka.yml file where I have specified the startingOffsets
:
checkpointLocation: "/user/aiman/checkpoint/kafka_checkpoint/test_topic"
kafka.bootstrap.servers: "kafka.server.com:9093"
subscribe: "TEST_TOPIC"
startingOffsets: {"TEST_TOPIC": {"0":-2}}
kafka.security.protocol: "SSL"
kafka.ssl.keystore.location: "kafka.keystore.uat.jks"
kafka.ssl.keystore.password: "abc123"
kafka.ssl.key.password: "abc123"
kafka.ssl.truststore.type: "JKS"
kafka.ssl.truststore.location: "kafka.truststore.uat.jks"
kafka.ssl.truststore.password: "abc123"
kafka.ssl.enabled.protocols: "TLSv1"
kafka.ssl.endpoint.identification.algorithm: ""
When I set the offset to:
startingOffsets: "earliest"
I get the following error:
Traceback (most recent call last):
File "/opt/cloudera/parcels/IMMUTA/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/cloudera/parcels/IMMUTA/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o193.load.
: java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got "earliest"
at org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.getKafkaOffsetRangeLimit(KafkaSourceProvider.scala:485)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createMicroBatchReader(KafkaSourceProvider.scala:128)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createMicroBatchReader(KafkaSourceProvider.scala:48)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Similarly when I give the offset as:
startingOffsets: "latest"
I get the error:
Traceback (most recent call last):
File "/opt/cloudera/parcels/IMMUTA/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/cloudera/parcels/IMMUTA/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o193.load.
: java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got "latest"
at org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.getKafkaOffsetRangeLimit(KafkaSourceProvider.scala:485)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createMicroBatchReader(KafkaSourceProvider.scala:128)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createMicroBatchReader(KafkaSourceProvider.scala:48)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
And when I set the offset as blank:
startingOffsets: {"TEST_TOPIC": {}}
I get the error:
pyspark.sql.utils.StreamingQueryException: 'Failed to construct kafka consumer\n=== Streaming Query ===\nIdentifier: [id = a8799b99-2040-49f1-a155-58f64bbef78d, runId = 600b728d-bfe5-41fc-90d1-40f6bbc5d35c]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [cast(key#7 as string) AS key#21, cast(value#8 as string) AS value#22, offset#11L, timestamp#12, partition#10]\n+- StreamingExecutionRelation KafkaV2[Subscribe["TEST_TOPIC"]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
How do I make it work, assuming if there is no data present in a topic (as the error states "Current Committed Offsets: {}\nCurrent Available Offsets: {}")