First of I'm kinda new to Storm/Trident and I'm struggling with a problem for many hours already.
What I have is one Kafka topic with one partition. A producer sends tuples to this topic every x milliseconds. A TransactionalTridentKafkaSpout reads from this topic and some Trident operators process them. The whole topology is running in local mode (remote mode is untested so far).
The main code of the topology is:
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(spoutConf);
TridentTopology topology = new TridentTopology();
Stream inStream = topology.newStream("kafka-spout", spout).parallelismHint(4);
TridentState state1=inStream
.groupBy(new Fields(ID_FIELD))
.persistentAggregate(new MemoryMapState.Factory(), new Fields(ID_FIELD, FIELD1, FIELD2, FIELD3), new CustomCombinerAgg1(), new Fields(COMB_AGG_1_FIELD))
.parallelismHint(4);
state1.newValuesStream().groupBy(new Fields(ID_FIELD)).
persistentAggregate(new MemoryMapState.Factory(), new Fields(ID_FIELD, COMB_AGG_1_FIELD), new CustomCombinerAgg2(), new Fields(COMB_AGG_2_FIELD))
.parallelismHint(4);
state1.newValuesStream().filter(new Fields(ID_FIELD, COMB_AGG_1_FIELD), new CustomBaseFilter1());
inStream.groupBy(new Fields(ID_FIELD))
.persistentAggregate(new MemoryMapState.Factory(), new Fields(ID_FIELD, FIELD1, FIELD2), new CustomCombinerAgg3(), new Fields(COMB_AGG_3_FIELD));
inStream.groupBy(new Fields(ID_FIELD))
.persistentAggregate(new MemoryMapState.Factory(), new Fields(ID_FIELD, FIELD1, FIELD2, FIELD3), new CustomCombinerAgg4(), new Fields(COMB_AGG_4_FIELD))
.newValuesStream().filter(new Fields(ID_FIELD, COMB_AGG_4_FIELD), new CustomBaseFilter2());
Now the problem I have is that the lower the message interval of the producer is the less some of the operators are executed.
For example if the producer sends 200 tuples in an interval of 100 ms each every operator correctly handles all the 200 tuples but if the interval is set to 20 ms then e.g. the operators handle / are executed for following number of tuples only:
CustomCombinerAgg1: 200
CustomCombinerAgg2: 50
CustomBaseFilter1: 50
CustomCombinerAgg3: 150
CustomCombinerAgg4: 180
CustomBaseFilter2: 60
As far as I understood (Transactional) Trident guarantees exactly once processing and a new batch of tuples should be fetched from the spout only once the previous one has been fully processed. This seems not to be the case here and it rather looks like the first operator, CustomCombinerAgg1, dictates the speed and the following operators then can't handle all the tuples in the given time?
What I would expect is that every operator is properly executed for every tuple and once the tuple/batch has been processed by all operators the next one is being fetched. Should this not be the case using Trident? Am I doing something wrong? How can I achieve this behaviour?
How does Trident even know when a tuple has been fully processed? As far as I know you have to ack() the tuples in Storm, but Trident operators have no OutputCollecter and hence cannot call ack()? Is my problem tied to this somehow?
Thanks.