0

My goal is to get data from non-file sources (i.e. generated within a program or sent though an API) and have it sent to a spark stream. To accomplish this, I'm sending the data through a python-based KafkaProducer:

$ bin/zookeeper-server-start.sh config/zookeeper.properties &
$ bin/kafka-server-start.sh config/server.properties &
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic my-topic
$ python 
Python 3.6.1| Anaconda custom (64-bit)
> from kafka import KafkaProducer
> import time
> producer = KafkaProducer(bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8'))
> producer.send(topic = 'my-topic', value = 'MESSAGE ACKNOWLEDGED', timestamp_ms = time.time())
> producer.close()
> exit()

My issue is that nothing appears when checking the topic from the consumer shell script:

$ bin/kafka-console-consumer.sh --bootstrap-server localhost:2181 --topic my-topic
^C$

Is something missing or wrong here? I'm new to spark/kafka/messaging systems, so anything will help. The Kafka version is 0.11.0.0 (Scala 2.11) and no changes are made to the config files.

user2361174
  • 1,872
  • 4
  • 33
  • 51

2 Answers2

1

If you start a consumer after sending messages to a topic, it is possible that the consumer will skip that messages because it will set a topic offset (which could be considered as a "starting point" to read from) to the topic's end. To change that behavior try to add --from-beginning option:

$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --from-beginning

Also you can try kafkacat, which is more convenient than Kafka's console consumer and producer (imho). Reading messages from Kafka with kafkacat can be performed with the following command:

kafkacat -C -b 'localhost:9092' -o beginning -e -D '\n' -t 'my-topic'

Hope it will help.

xscratt
  • 159
  • 5
  • I added `from-beginning` but the result was the same. I also installed kafkacat, redid my steps and ran your command but it still wasn't finding the message. – user2361174 Jul 10 '17 at 18:52
  • 1
    @user2361174 just checked your example, and it seems like the producer does not send anything because of `timestamp_ms=time.time()` – if turn on debug logging, the following message will appear in the log: `DEBUG:kafka.producer.kafka:Exception occurred during message send: `. Probably `time.time()` returns timestamp in the format which is unexpected to the producer... So removing that option should do the trick, i.e. `producer.send(topic='my-topic', value='MESSAGE ACKNOWLEDGED')` (current timestamp will be used by default). – xscratt Jul 10 '17 at 21:01
  • I got rid of the timestamp although nothing is still showing up in the consumer or the kafkacat command. I've saved the command line output here: https://raw.githubusercontent.com/dretta/spark/master/kafka.log – user2361174 Jul 10 '17 at 21:59
0

I found the issue, the value_serializer silently breaks down because I didn't import the json module to the interpreter. Two solutions for this, one is simply import the module and the you'll get "MESSAGE ACKNOWLEDGED" (with quotation marks) back. Or you can remove value_serializer altogether and convert the value string that's being sent in the next line into a byte string (i.e. b'MESSAGE ACKNOWLEDGED' for Python 3) so you'll get the message back without quotation marks.

I also switched Kafka to version 0.10.2.1 (Scala 2.11) since there's no confirmation in the Kafka-python documents that says it's compatible with version 0.11.0.0

user2361174
  • 1,872
  • 4
  • 33
  • 51