2

How/is it possible to generate a random number or obtain system time for each time a batch is run with Spark Streaming?

I have two functions which process a batch of messages: 1 - First processes the Key, creates a file (csv) and writes headers 2 - Second processes each of the messages and adds the data to the csv

I wish to store the files for each batch in separate folders:

/output/folderBatch1/file1.csv, file2.csv, etc.csv
/output/folderBatch2/file1.csv, file2.csv, etc.csv
/output/folderBatch3/file1.csv, file2.csv, etc.csv

How can I create a variable, even just a simple counter that Spark Streaming can use?

The code below gets the system time but because it's 'plain Java' it gets executed just once and is the same value on each run of the batch.

JavaPairInputDStream<String, byte[]> messages;
messages = KafkaUtils.createDirectStream(
        jssc,
        String.class,
        byte[].class,
        StringDecoder.class,
        DefaultDecoder.class,
        kafkaParams,
        topicsSet
);

/**
 * Declare what computation needs to be done
 */
JavaPairDStream<String, Iterable<byte[]>> groupedMessages = messages.groupByKey();

String time = Long.toString(System.currentTimeMillis());        //this is only ever run once and is the same value for each batch!

groupedMessages.map(new WriteHeaders(time)).print();

groupedMessages.map(new ProcessMessages(time)).print();

Thank you, KA.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Ken Alton
  • 686
  • 1
  • 9
  • 21
  • Why not pass `System.currentTimeMillis` directly to `groupedMessage.map(..)`? – Yuval Itzchakov Oct 25 '16 at 12:16
  • The value needs to be the same for both map functions per batch. If it's not the same value I'll have csv files with header and no data and data files with no header. – Ken Alton Oct 25 '16 at 12:20
  • For starters, you can consider joining those two operations into a single `map` call. That way you can locally share the timestamp. Other than that, you can create a tuple: `Tuple2>` which captures the timestamp. – Yuval Itzchakov Oct 25 '16 at 12:22
  • I've tried: Tuple2 timeStamp = new Tuple2<>(Long.toString(System.currentTimeMillis()), new byte[]{}); but of course, this just gets executed once - how can I write this in the 'Spark Streaming' way? – Ken Alton Oct 25 '16 at 12:29

1 Answers1

1

You can add the timestamp via an additional map call to and flow it along. This means that instead of a value of type Iterable<byte[]>, you'll have a value of Tuple2<Long, Iterable<byte[]>):

JavaDStream<Tuple2<String, Tuple2<Long, Iterable<byte[]>>>> groupedWithTimeStamp = 
  groupedMessages
    .map((Function<Tuple2<String, Iterable<byte[]>>, 
      Tuple2<String, Tuple2<Long, Iterable<byte[]>>>>) kvp -> 
        new Tuple2<>(kvp._1, new Tuple2<>(System.currentTimeMillis(), kvp._2)));

And now you have the timestamp with you in each map going on from now forward, and you can access it via:

groupedWithTimeStamp.map(value -> value._2._1); // This will access the timestamp.
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321