SparkStreaming, RabbitMQ and MQTT in python using pika

Question

Just to make things tricky, I'd like to consume messages from the rabbitMQ queue. Now I know there is a plugin for MQTT on rabbit (https://www.rabbitmq.com/mqtt.html).

However I cannot seem to make an example work where Spark consumes a message that has been produced from pika.

For example I am using the simple wordcount.py program here (https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html) to see if I can I see a message producer in the following way:

import sys
import pika
import json
import future
import pprofile

def sendJson(json):

  connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
  channel = connection.channel()

  channel.queue_declare(queue='analytics', durable=True)
  channel.queue_bind(exchange='analytics_exchange',
                       queue='analytics')

  channel.basic_publish(exchange='analytics_exchange', routing_key='analytics',body=json)
  connection.close()

if __name__ == "__main__":
  with open(sys.argv[1],'r') as json_file:
    sendJson(json_file.read())

The sparkstreaming consumer is the following:

import sys
import operator

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.mqtt import MQTTUtils

sc = SparkContext(appName="SS")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 1)
ssc.checkpoint("checkpoint")
#ssc.setLogLevel("ERROR")


#RabbitMQ

"""EXCHANGE = 'analytics_exchange'
EXCHANGE_TYPE = 'direct'
QUEUE = 'analytics'
ROUTING_KEY = 'analytics'
RESPONSE_ROUTING_KEY = 'analytics-response'
"""


brokerUrl = "localhost:5672" # "tcp://iot.eclipse.org:1883"
topic = "analytics"

mqttStream = MQTTUtils.createStream(ssc, brokerUrl, topic)
#dummy functions - nothing interesting...
words = mqttStream.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

wordCounts.pprint()
ssc.start()
ssc.awaitTermination()

However unlike the simple wordcount example, I cannot get this to work and get the following error:

16/06/16 17:41:35 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 8)
java.lang.NullPointerException
    at org.eclipse.paho.client.mqttv3.MqttConnectOptions.validateURI(MqttConnectOptions.java:457)
    at org.eclipse.paho.client.mqttv3.MqttAsyncClient.<init>(MqttAsyncClient.java:273)

So my questions are, what should be the settings in terms of MQTTUtils.createStream(ssc, brokerUrl, topic) to listen into the queue and whether there are any more fuller examples and how these map onto those of rabbitMQ.

I am running my consumer code with: ./bin/spark-submit ../../bb/code/skunkworks/sparkMQTTRabbit.py

I have updated the producer code as follows with TCP parameters as suggested by one comment:

url_location = 'tcp://localhost'
url = os.environ.get('', url_location)
params = pika.URLParameters(url)
connection = pika.BlockingConnection(params)

and the spark streaming as:

brokerUrl = "tcp://127.0.0.1:5672"
topic = "#" #all messages

mqttStream = MQTTUtils.createStream(ssc, brokerUrl, topic)
records = mqttStream.flatMap(lambda line: json.loads(line))
count = records.map(lambda rec: len(rec))
total = count.reduce(lambda a, b: a + b)
total.pprint()

zero323 · Accepted Answer · 2016-07-03T21:53:19.073

It looks like you are using wrong port number. Assuming that:

you have a local instance of RabbitMQ running with default settings and you've enabled MQTT plugin (rabbitmq-plugins enable rabbitmq_mqtt) and restarted RabbitMQ server
included spark-streaming-mqtt when executing spark-submit / pyspark (either with packages or jars / driver-class-path)

you can connect using TCP with tcp://localhost:1883. You have to also remember that MQTT is using amq.topic.

Quick start:

create Dockerfile with following content:

FROM rabbitmq:3-management

RUN rabbitmq-plugins enable rabbitmq_mqtt

build Docker image:
```
docker build -t rabbit_mqtt .
```

start image and wait until server is ready:

docker run -p 15672:15672 -p 5672:5672 -p 1883:1883 rabbit_mqtt

create producer.py with following content:

import pika
import time 


connection = pika.BlockingConnection(pika.ConnectionParameters(
    host='localhost'))
channel = connection.channel()
channel.exchange_declare(exchange='amq.topic',
                 type='topic', durable=True)

for i in range(1000):
    channel.basic_publish(
        exchange='amq.topic',  # amq.topic as exchange
        routing_key='hello',   # Routing key used by producer
        body='Hello World {0}'.format(i)
    )
    time.sleep(3)

connection.close()

start producer
```
python producer.py
```
and visit management console http://127.0.0.1:15672/#/exchanges/%2F/amq.topic

to see that messages are received.

create consumer.py with following content:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.mqtt import MQTTUtils

sc = SparkContext()
ssc = StreamingContext(sc, 10)

mqttStream = MQTTUtils.createStream(
    ssc, 
    "tcp://localhost:1883",  # Note both port number and protocol
    "hello"                  # The same routing key as used by producer
)
mqttStream.count().pprint()
ssc.start()
ssc.awaitTermination()
ssc.stop()

download dependencies (adjust Scala version to the one used to build Spark and Spark version):
```
mvn dependency:get -Dartifact=org.apache.spark:spark-streaming-mqtt_2.11:1.6.1
```
make sure SPARK_HOME and PYTHONPATH point to the correct directories.

submit consumer.py with (adjust versions as before):

spark-submit --packages org.apache.spark:spark-streaming-mqtt_2.11:1.6.1 consumer.py

If you followed all the steps you should see Hello world messages in the Spark log.

Thanks. I'll take a look. Can this work with direct and well as topic? — disruptive, Jul 04 '16 at 16:34
MQTT plugin [can be configured](https://www.rabbitmq.com/mqtt.html#config) to use different exchange but as far as I can tell this it. MQTT protocol is not much richer than that anyway. — zero323, Jul 04 '16 at 21:49
Is there a way to configure this without docker - for example using the .config file. I have tried with the default settings in https://www.rabbitmq.com/mqtt.html. But this does not work at all. With no settings, my spark listener can connect with the following: =INFO REPORT==== 5-Jul-2016::11:52:08 === accepting MQTT connection <0.321.0> (127.0.0.1:47868 -> 127.0.0.1:1883). But how to make the produced messages map onto this port? — disruptive, Jul 05 '16 at 09:58
Docker is not crucial here but I don't really understand the question. Port is not a property of message. I It is a global property of the server. If topics and exchange match there should be no reason for any issues. What do you mean by "it doesn't work"? When you check RabbitMQ UI do you see bindings from producer? How about consumer? Does routing keys match? — zero323, Jul 05 '16 at 10:57
I had tried using the standard message queue we use. I now tried using the topic without queue and this seems to work right out of the box. However didn't use Docker. — disruptive, Jul 05 '16 at 15:05
This answer is great. Now were are trying a more refined solution. @zero323 however the typical use case we have is a direct exchange with a queue bound to this exchange. I have not seen anyway to make this work sadly. Any suggestions how we might use the direct approach? — disruptive, Jul 26 '16 at 09:23

score 2 · Answer 2 · answered Jun 25 '16 at 09:55

2

From the MqttAsyncClient Javadoc, the server URI must have one of the following schemes: tcp://, ssl://, or local://. You need to change your brokerUrl above to have one of these schemes.

For more information, here's a link to the source for MqttAsyncClient:

https://github.com/eclipse/paho.mqtt.java/blob/master/org.eclipse.paho.client.mqttv3/src/main/java/org/eclipse/paho/client/mqttv3/MqttAsyncClient.java#L272

answered Jun 25 '16 at 09:55

ck1

5,243
1
21
25

1

I attempted to change the producer to use tcp instead of http, however I found that I now get a connection issue of the following: ERROR ReceiverSupervisorImpl: Stopped receiver with error: Connection lost (32109) - java.net.SocketException: Connection reset – disruptive Jun 29 '16 at 10:54

SparkStreaming, RabbitMQ and MQTT in python using pika

2 Answers2

Linked