0

As the spark docs says,it support kafka as data streaming source.but I use ZeroMQ,And there is not a ZeroMQUtils.so how can I use it? and generally,how about other MQs. I am totally new to spark and spark streaming, so I am sorry if the question is stupid.Could anyone give me a solution.Thanks BTW,I use python.

Update, I finally did it in java with a Custom Receiver. Below is my solution

public class ZeroMQReceiver extends Receiver<T> {

    private static final ObjectMapper mapper = new ObjectMapper();

    public ZeroMQReceiver() {

        super(StorageLevel.MEMORY_AND_DISK_2());
    }

    @Override
    public void onStart() {
        // Start the thread that receives data over a connection
        new Thread(this::receive).start();
    }

    @Override
    public void onStop() {
        // There is nothing much to do as the thread calling receive()
        // is designed to stop by itself if isStopped() returns false
    }

    /** Create a socket connection and receive data until receiver is stopped */
    private void receive() {
        String message = null;

        try {

            ZMQ.Context context = ZMQ.context(1); 
            ZMQ.Socket subscriber = context.socket(ZMQ.SUB);     
            subscriber.connect("tcp://ip:port");    
            subscriber.subscribe("".getBytes());  

            // Until stopped or connection broken continue reading
            while (!isStopped() && (message = subscriber.recvStr()) != null) {
                List<T> results = mapper.readValue(message,
                        new TypeReference<List<T>>(){} );
                for (T item : results) {
                    store(item);
                }
            }
            // Restart in an attempt to connect again when server is active again
            restart("Trying to connect again");
        } catch(Throwable t) {
            // restart if there is any other error
            restart("Error receiving data", t);
        }
    }
}
JoshMc
  • 10,239
  • 2
  • 19
  • 38
youngjack
  • 81
  • 8

1 Answers1

0

I assume you are talking about Structured Streaming.

I am not familiar with ZeroMQ, but an important point in Spark Structured Streaming sources is replayability (in order to ensure fault tolerance), which, if I understand correctly, ZeroMQ doesn't deliver out-of-the-box.

A practical approach would be buffering the data either in Kafka and using the KafkaSource or as files in a (local FS/NFS, HDFS, S3) directory and using the FileSource for reading. Cf. Spark Docs. If you use the FileSource, make sure not to append anything to an existing file in the FileSource's input directory, but move them into the directory atomically.

Bernhard Stadler
  • 1,725
  • 14
  • 24