- I use
fields grouping
with storm. - The problem is that because I have multiple machines and multiple bolts (obviously) then the
tuple
moves in between the machine and I suspect this reduces my performance drastically. - Is it possible for a specific field grouping outcome to be specific to a specific machine?
- Or in more details for field grouping
account1
be sticky with all bolts tomachine1
foraccount2
machine3
foraccount3
be sticky for all bolts tomachine1
and so on for eachaccount
to have all its bolts running in a specific single machine? - Note that once the first
bolt
processes the event it emitsaccountid
in its output tuple all furtherbolts
from that point and on haveaccountid
meaning I would want to dofield grouping
from that point and on onaccountid
for furtherbolts
in thetopology
. (added for clarification after seeing first answer)

- 14,493
- 27
- 97
- 148
-
http://stackoverflow.com/questions/36368224/is-there-a-way-to-apply-multiple-groupings-in-storm/36374837?noredirect=1#comment60465924_36374837 also discusses the same issue. – user2250246 Apr 05 '16 at 17:40
2 Answers
Assume you have 3 producers P1,P2,P3 and three consumers C1,C2,C3 and 3 machines each hosting a single Producer-Consumer-Pair, ie, P1-C1. Furthermore, assume you have 3 distinct key values a,b,c
. Furthermore, assume that C1 processes all tuples with key a
.
In general, tuples with key a
can be emitted by all three producers. Furthermore, P1, can also emit tuples with key b
or c
. Thus, you cannot limit the data transfer to local machines using fields-grouping all you need to re-partition all data.
Extension
If you have additional bolts B1 to B3 that consumes data from C1 to C3 and those use the same fields-grouping key as C1 to C3 (ie, Bx could exploit the already given partitioning from Cx), you would need to ensure that B1 to B3 are co-located on the same machine as C1 to C3 and avoid a re-partitioning. Co-location can be achieved by providing a custom scheduler to Storm. See here for an example: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/
However, you cannot use field-grouping to connect B1->C1 etc. because fields-grouping is agnostic to operator co-location and the already given partitioned data (it would just re-partition that data again). Instead, you would need to use direct- or custom-grouping to ensure that all data from B1 is sent to C1 etc.

- 59,682
- 7
- 117
- 137
-
What I thought of is having each and every of the `(Px,Cx)` pair on all of the machines and in my case as soon as `C1` emits its results it has the `accountid` in this result so that I can from now and on do `fields grouping` on this `accountid` all further producers and consumers in this topology from that point and on would all have an `accountid` in their outputs so in that case I do have an `accountid` emitted in all outputs but the first output and was planning to have all Px,Cx on all hosts. any chance such a thing would be possible with `storm` implementation? – Jas Feb 01 '16 at 15:05
-
I cannot follow... What do you mean by "having each and every of the `(Px,Cx)` pair on all of the machines" ? I extended my answer (hope this covers the second part of your question -- if I understand you correctly) – Matthias J. Sax Feb 01 '16 at 15:21
can localOrShuffleGrouping help? https://github.com/apache/storm/blob/a4f9f8bc5b4ca85de487a0a868e519ddcb94e852/storm-core/src/jvm/org/apache/storm/topology/TopologyBuilder.java#L360

- 865
- 1
- 10
- 18
-
is it deterministic i mean for a certain key `account1` would it be possible that on one tuple it will reach `local bolt` and on another reach a `remote bolt` or if it sent something to `local bolt` it will keep on with it and for `account2` if sent to `remote bolt` it would continue sending it to that same `remote bolt` thanks. – Jas Feb 01 '16 at 17:15