Does hazelcast jet stream stores data in nodes along with aggregation

Question

I am using hazelcast jet to aggreagte(sum) stream of data

Source is kafka where i receive integer and jet stream simply adds each incoming number.

I have few questions 1. When it receives each number along with a it saves the data in IMap, how can i access that snapshot?

To understand what it is storing in snapshot the aggregated result or raw numbers — Abhishek, Jan 24 '19 at 04:07

score 1 · Answer 1 · answered Jan 23 '19 at 19:03

1

@Abhishek, Hazelcast-Jet takes snapshots if you configure it, and not with each number, with a time period. If you want to access map, you cannot & even if you access, the data stored in that map uses an internal data structure, you cannot just view your numbers there.

If you can share what kind of information you're trying to get, I can help you more. (Along with your job definition to understand it a bit if possible)

answered Jan 23 '19 at 19:03

Gokhan Oner

3,237
19
25

My use case is simple just add each num er coming in the stream(which can be kafka or filewatcher or anything) my question is if what jet is doing is always add two no one aggregated till now and the current no than why do we need multiple jet nodes to just add 2 numbers. Even for any complex calc it will just do it on last computed result and the new number? To understand what it is storing in the map i wanted to access it – Abhishek Jan 24 '19 at 04:06
@Abhishek, for your use case, you can use a single Jet node. Purpose of having multiple processing nodes is when you have hundreds of thousands of data point per seconds and you need to process them in groups. Having multiple members speed up these kinds of calculations + gives you fault tolerance so even if you lost a member, calculation just continues. – Gokhan Oner Jan 24 '19 at 04:11
If the data is unbounded jet always will receive and process data one at a time does that mean in case of unbounded data we just need one node and with batch data only we need multiple nodes and other question in in case of streaming data will jet store each number within its map or aggregated result? – Abhishek Jan 24 '19 at 04:19
You're wrong. When using Kafka as a Source, Jet read data from all nodes & all Kafka partitions. Which means when you have multiple Kafka partitions & multiple Jet nodes, your throughput will increase. It depends on the source type and its capability of supporting multiple consumers – Gokhan Oner Jan 24 '19 at 04:51
Does that mean jet will consume one message at a time from one partiton of a kafka node at a time? in that case lets it consumes from multiple nodes practically there will few no of kafka node and few no of partition all in all it will have lets say 1000 messages consuming at once, still to process 1000 messages at a time we do not need multiple nodes? Sorry i couldnt find these specific answers in the doc so clarify my understanding asking these question. Apologies if these are lame questions – Abhishek Jan 24 '19 at 05:00
No. For Kafka, you tell Jet how many bytes it should get while consuming, by setting `message.max.bytes` parameter in the broker properties, default to 1000012 bytes. (see http://kafka.apache.org/documentation.html#brokerconfigs) For your use case, you'll most probably be OK with a single Jet node if you only have 1000 message. Jet designed for processing millions of messages per second. – Gokhan Oner Jan 24 '19 at 06:41
Ok got you, thanks so mych for clarification. One last question will jet store the raw data (in this case each number) in the map along with processing that data or will it only store aggregated result (independent of snapshot is set to true or false) – Abhishek Jan 24 '19 at 06:52
If you enabled snapshots, it'll store all internal data for all processes, including sources, aggregations etc. Please see: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#distributed-snapshot Again, not each data but the data in the relevant processor when the snapshot taken – Gokhan Oner Jan 24 '19 at 07:09
Snapshots are for fault tolerance. That is, if a member fails, you're able to restart on the remaining members. It's not designed to query the state of the computation. – Oliv Jan 24 '19 at 08:07
When it takes new snapshot does it delete the old one as we have latest data now? – Abhishek Jan 24 '19 at 08:24
Jet cluster doesnt store the raw data but some internal data on cluster what is space complexity of that data is it based on the number of data, the idea is to know how much memory requirement will my jet cluster would need will it be constant or (n) where n is number of message till now in a continious stream of data – Abhishek Jan 24 '19 at 19:30

Does hazelcast jet stream stores data in nodes along with aggregation

1 Answers1