1

I have built a prototype application with Spark Streaming in Java which uses HyperLogLog to estimate distinct users from a simulated click stream.

Let me briefly sketch my solution. First I create a stream with the KafkaUtils:
JavaPairReceiverInputDStream<String, String> directKafkaStream = KafkaUtils.createStream(streamingContext, ZOOKEEPER_ADDRESS, ZOOKEEPER_GROUP, topics);

From there I create a stream which only contains the required field, fullvisitorid:
JavaDStream<String> fullvisitorids = directKafkaStream.map(line -> line._2().split(",")[0]);

To maintain a global state (my HyperLogLog-Object) the only way I found was the udpateStateByKey or mapWithState methods. Both seem to require a key-value pair... but in my use case I don't need a key.

So I decided to use a "dummy key":
fullvisitorids.mapToPair(value -> new Tuple2<String, String>("key", value));

But now my questions:
a) How does Spark parallelize transformations with updateStateByKey or mapWithState on this stream which only have one single key? Or how does it partition the RDD over the cluster?

b) Is there a better solution for my problem than adding a dummy key which doesn't have any function at all?

JayKay
  • 152
  • 11

1 Answers1

1

a) The stream will not be parallelized if you use the Hash partitioner with a single value for the key. Either define your own partitioner or don't use a single key.

b) The solution would be to not use updateStateByKey, which is not intended for global state. You should just use a global single HLL object, e.g. from Algebird (here is a Gist that demonstrates how this might look).

Marius Soutier
  • 11,184
  • 1
  • 38
  • 48