7

I have the following spark job:

from __future__ import print_function

import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext

if __name__ == "__main__":

    conf = SparkConf().setAppName("PySpark Cassandra Test")
    sc = CassandraSparkContext(conf=conf)
    stream = StreamingContext(sc, 2)

    rdd=sc.cassandraTable("keyspace2","users").collect()
    #print rdd
    stream.start()
    stream.awaitTermination()
    sc.stop() 

When I run this, it gives me the following error:

ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute

the shell script I run:

./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py

Comparing spark streaming with kafka, I have this line missing from the above code:

kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})

where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?

Versions:

Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
zero323
  • 322,348
  • 103
  • 959
  • 935
HackCode
  • 1,837
  • 6
  • 35
  • 66
  • You want to stream from Cassandra to Spark? I don't think that's supported at the moment. Saving streaming data *to* cassandra is supported: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md – maasg Jan 26 '16 at 17:21
  • Yup, I want to stream from CASSANDRA to SPARK. I thought I was pretty close with the script that I wrote, I just needed to register an operation with the stream, that is "createStream" perhaps. I know how to stream from spark to cassandra. – HackCode Jan 26 '16 at 17:26
  • Do you want to stream the whole table (`cassandraTable("keyspace2","users")`) every time interval? – maasg Jan 26 '16 at 17:48
  • @maasg Yes, basically for data analysis. – HackCode Jan 26 '16 at 18:06

1 Answers1

1

To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.

Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.

See also: Reading from Cassandra using Spark Streaming for an example.

Community
  • 1
  • 1
maasg
  • 37,100
  • 11
  • 88
  • 115
  • thanks for the answer, but I tried searching for its implementation with pyspark but couldn't find any. Is it supported with python? – HackCode Jan 27 '16 at 10:27
  • @HackCode after checking the Python API, it looks like `ConstantInputDStream` does not exist for the Python bindings: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#module-pyspark.streaming – maasg Jan 27 '16 at 16:44
  • @HackCode Did you ever find a solution for this? If `ConstantInputDStream` does not exist in the Python API, how can PySpark Streaming work with Cassandra? – user2361174 Jul 12 '17 at 21:14