1

So my project flow is Kafka -> Spark Streaming ->HBase

Now I want to read data again from HBase which will go over the table created by the previous job and do some aggregation and store it in another table in different column format

Kafka -> Spark Streaming(2ms)->HBase->Spark Streaming (10ms)->HBase

Now I don't know how to read data from HBase using Spark Streaming. I found a Cloudera Lab Project that is SparkOnHbase(http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/) library, but I can't figure out how to get a inputDStream for stream processing from HBase. Please provide any pointers or library links if there are any which will help me do this.

2 Answers2

0

You can create DStream from Queue of RDDs using queueStream: StreamingContext

JavaSparkContext sc = new JavaSparkContext(conf);
org.apache.hadoop.conf.Configuration hconf = HBaseConfiguration.create();
JavaHBaseContext jhbc = new JavaHBaseContext(sc, hconf);
Scan scan1 = new Scan();           
scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, tableName.getBytes());

// Create RDD
         rdd = jhbc.hbaseRDD(tableName, scan1, new Function<Tuple2<ImmutableBytesWritable, Result>, Tuple2<ImmutableBytesWritable, Result>>() {
            @Override
            public Tuple2<ImmutableBytesWritable, Result> call(Tuple2<ImmutableBytesWritable, Result> immutableBytesWritableResultTuple2) throws Exception {
                return immutableBytesWritableResultTuple2;
            }
        });

   // Create streaming context and queue
   JavaSparkStreamingContext ssc = new JavaSparkStramingContext(sc);

   Queue<JavaRDD<Tuple2<ImmutableBytesWritable, Result> >> queue =new Queue<JavaRDD<Tuple2<ImmutableBytesWritable, Result>>>( );
        queue.enqueue(rdd);

JavaDStream<Tuple2<ImmutableBytesWritable, Result>> ssc.queueStream(queue);

PS: You could just use Spark(without streaming for that)

miroB
  • 468
  • 6
  • 8
  • Yea I just wanted my spark job to chase the first streaming job. (I guess doing simple Spark is easier then this). I will try this out, but I am using Scala and wonder if the same strategy will work for Scala as well...Thanks though ! – Dhruvrajsinh R Parmar Jul 26 '16 at 20:57
0

Splice Machine (Open Source) has a demo showing spark streaming running.

http://community.splicemachine.com/category/tutorials/data-ingestion-streaming/

Here is sample code for this use case.

https://github.com/splicemachine/splice-community-sample-code/tree/master/tutorial-kafka-spark-streaming

John Leach
  • 518
  • 1
  • 3
  • 9