How/is it possible to generate a random number or obtain system time for each time a batch is run with Spark Streaming?
I have two functions which process a batch of messages: 1 - First processes the Key, creates a file (csv) and writes headers 2 - Second processes each of the messages and adds the data to the csv
I wish to store the files for each batch in separate folders:
/output/folderBatch1/file1.csv, file2.csv, etc.csv
/output/folderBatch2/file1.csv, file2.csv, etc.csv
/output/folderBatch3/file1.csv, file2.csv, etc.csv
How can I create a variable, even just a simple counter that Spark Streaming can use?
The code below gets the system time but because it's 'plain Java' it gets executed just once and is the same value on each run of the batch.
JavaPairInputDStream<String, byte[]> messages;
messages = KafkaUtils.createDirectStream(
jssc,
String.class,
byte[].class,
StringDecoder.class,
DefaultDecoder.class,
kafkaParams,
topicsSet
);
/**
* Declare what computation needs to be done
*/
JavaPairDStream<String, Iterable<byte[]>> groupedMessages = messages.groupByKey();
String time = Long.toString(System.currentTimeMillis()); //this is only ever run once and is the same value for each batch!
groupedMessages.map(new WriteHeaders(time)).print();
groupedMessages.map(new ProcessMessages(time)).print();
Thank you, KA.