2

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.

The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.

At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).

My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).

In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?

I found a similar question here: Aggregate data based on timestamp in JavaDStream of spark streaming

Thanks in advance

zero323
  • 322,348
  • 103
  • 959
  • 935
Sokrates
  • 93
  • 1
  • 11
  • I think there are two ways to interpret this. Do you want to process in the same order as dictated by the TIMESTAMP field, or do you want to also preserve the inter-arrival wait times between each event? – ImDarrenG Feb 14 '17 at 09:50
  • Cool, and are you performing aggregations, count by minute for e.g., or processing each event individually and in TIMESTAMP order? Sorry for all the questions it's just the subtlety of your use-case is gonna make a big difference to the approach. – ImDarrenG Feb 14 '17 at 11:07
  • Hi, I might need to work with both scenarios. I mean I have to build some aggregated stats based on a temporal window, but for calculating some of this counters i would need to check for a specific event arrival order based on the timestamp. – Sokrates Feb 14 '17 at 12:44

2 Answers2

3

TL;DR

yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.

Detailed answer

Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.

However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.

Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.

You'll need to give these points some thought:

  • If a batch fails and is retried, how will that affect your counters?
  • If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
  • Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
  • Will the partitioning of events cause complexity in the order that they are processed?

My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.

I hope that helps in some way.

ImDarrenG
  • 2,315
  • 16
  • 24
0

It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.

df.groupBy(window($"timestamp", "1 minute") as 'time)
  .count()
Michael Armbrust
  • 1,545
  • 11
  • 12