0

I am trying to sum continuous stream of numbers from a file using hazelcast jet

pipe
    .drawFrom(Sources.fileWatcher)<dir>))
    .map(s->Integer.parseInt(s))
    .addTimestamps()
    .window(WindowDefinition.sliding(10000,1000))
    .aggregate(AggregateOperations.summingDouble(x->x))
    .drainTo(Sinks.logger());

Few questions

  1. It doesn't give the expected output, my expectation is as soon as new number appears in the file, it should just add it to the existing sum
  2. To do this why i need to give window and addTimestamp method, i just need to do sum of infinite stream
  3. How can we achieve fault tolerance, i. e. if server restarts will it save the aggregated result and when it comes up it will aggregate from the last computed sum?
  4. if the server is down and few numbers come in file now when the server comes up, will it read from last point from when the server went down or will it miss the numbers when it was down and will only read the number it got after the server was up.
Oliv
  • 10,221
  • 3
  • 55
  • 76
Abhishek
  • 519
  • 1
  • 6
  • 24

1 Answers1

0

Answer to Q1 & Q2: You're looking for rollingAggregate, you don't need timestamps or windows.

pipe
    .drawFrom(Sources.fileWatcher(<dir>))
    .rollingAggregate(AggregateOperations.summingDouble(Double::parseDouble))
    .drainTo(Sinks.logger());

Answer to Q3 & Q4: the fileWatcher source isn't fault tolerant. The reason is that it reads local files and when a member dies, the local files won't be available anyway. When the job restarts, it will start reading from current position and will miss numbers added while the job was down.

Also, since you use global aggregation, data from all files will be routed to single cluster member and other members will be idle.

Oliv
  • 10,221
  • 3
  • 55
  • 76
  • Is rollingAggregator fault tolerant even in case the cluster itself goes down all nodes like kafka where it saves partition in local disk and if all kafka nodes go down it recovers from disk – Abhishek Jan 22 '19 at 16:20
  • Yes, `rollingAggregate` is fault tolerant. In the sample above, the source isn't. – Oliv Jan 23 '19 at 07:33
  • But as per the documentation it saves the data in IMap itself and when jet cluster restarts that IMap is gone how will it recover from that, i understand filewatcher may not be fault tolerant but if we have some other source like kafka topic still after the restart Map is gone. – Abhishek Jan 23 '19 at 07:47
  • It will survive job restart or member restart (if you have backups configured, which is the default), but not a cluster restart. If you need the IMap to survive cluster restart and cluster crash, you need the commercial HotRestart feature or do it yourself using the MapStore or dump the maps to disk before shutting down the cluster. – Oliv Jan 23 '19 at 11:25